Diskless management
===================

Image generation
----------------

Thanks to Ocean Stack's architecture, diskless image are simply virtual machine images that are exported through iSCSI. See :ref:`diskless blueprint` for details about diskless architecture.

Here, we will only document the image generation procedure. The compute node configuration management is out of scope.

First, to generate a diskless image, use the :ref:`Add a new service VM` procedure to add a new VM that will be our compute image. This reference VM will be designated as ``COMPUTE_VM``.

The procedure for generating a complete image is the following:

* If present, backup the old reference image :

.. code-block:: shell

  mv /volspoms1/pcocc/persistent_drives/${COMPUTE_VM}.qcow2 /volspoms1/pcocc/persistent_drives/${COMPUTE_VM}.qcow2.$(date +%F)

* Create a new reference image file :

.. code-block:: console

  # prepare-ocean-image.sh ${COMPUTE_VM}
  Formatting '/volspoms1/pcocc/persistent_drives/${COMPUTE_VM}.qcow2', fmt=qcow2 size=53687091200 backing_file='/volspoms1/pcocc/persistent_drives/rhel.latest.qcow2' encryption=off cluster_size=65536 lazy_refcounts=off

* On a working hypervisor, launch the reference VM :

.. code-block:: console

  # . /etc/sysconfig/pcocc-vm-${COMPUTE_VM}
  # pcocc alloc ${COMPUTE_VM}

* Follow the bootstrap process using the **pcocc** cli :

.. code-block:: console

  (pcocc/XXXXX) # pcocc console
  [...]
  [   48.578821] cloud-init[902]: + cloud-init-per instance distro_sync yum distribution-synchronization -y
  [   48.822081] cloud-init[902]: Loaded plugins: priorities, search-disabled-repos
  [   53.809157] cloud-init[902]: 437 packages excluded due to repository priority protections
  [   58.596408] cloud-init[902]: Resolving Dependencies
  [   58.598198] cloud-init[902]: --> Running transaction check
  [   58.599632] cloud-init[902]: ---> Package bind-libs-lite.x86_64 32:9.9.4-74.el7_6.1 will be updated
  [   58.854444] cloud-init[902]: ---> Package bind-libs-lite.x86_64 32:9.9.4-74.el7_6.2 will be an update
  [   58.873073] cloud-init[902]: ---> Package bind-license.noarch 32:9.9.4-74.el7_6.1 will be updated
  [...]

.. note:: Please note that the VM might not be reachable immediately. This is because of the time between the boot and the effective configuration of the SSH daemon. Be patient and check the console output for any error that could prevent SSH from listening correctly.

* Poll the VM for **cloud-init** completion, if the ``/run/cloud-init/result.json`` file is present, **cloud-init** process is complete :

.. code-block:: console

  (pcocc/XXXXX) # pcocc ssh -J ${COMPUTE_VM} -p 422 vm0 cat /run/cloud-init/result.json
  {
 "v1": {
  "datasource": "DataSourceNoCloud [seed=/dev/sr0][dsmode=net]",
  "errors": [
   "('users-groups', TypeError(\"Can not create sudoers rule addition with type u'bool'\",))",
   "('scripts-user', RuntimeError('Runparts: 1 failures in 1 attempted commands',))"
  ]
 }


* Do a first *sanity* reboot, to make sure that the correct kernel is booted.

.. code-block:: console

   (pcocc/XXXXX) # pcocc ssh -J ${COMPUTE_VM} -p 422 vm0 reboot


* The VM may boot using the DisklessTrap_ initramfs image. To jump out of the trap, exit the shell present on the console.

.. code-block:: console

  (pcocc/XXXXX) # pcocc console
  root@${COMPUTE_VM}_DisklessTrap:/root# exit
  [...]


* Apply, once again, a puppet run to make sure that kernel-related changes are correctly applied on the current kernel.

.. code-block:: console

  (pcocc/XXXXX) # pcocc ssh -J ${COMPUTE_VM} -p 422 vm0 puppet-apply
  [...]
  Notice: Applied catalog in 121.63 seconds

* Rebuild the initramfs, then reboot

.. code-block:: console

  (pcocc/XXXXX) # pcocc ssh -J ${COMPUTE_VM} -p 422 vm0 puppet-apply
  # dracut -fMv
  [...]
  *** Creating initramfs image file '/boot/initramfs-3.10.0-957.35.2.el7.x86_64.img' done ***
  # reboot

* Again, the VM will boot using the DisklessTrap_ initramfs image. To jump out of the trap, exit the shell present on the console.

.. code-block:: console

  (pcocc/XXXXX) # pcocc console
  root@${COMPUTE_VM}_DisklessTrap:/root# exit
  [...]

* Extract the initramfs and vmlinuz files from the image

.. code-block:: console

  (pcocc/XXXXX) # pcocc ssh -p 422 vm0 "tar -C /boot -czO initramfs-$(uname -r).img vmlinuz-$(uname -r)" | tar -C /volspoms1/pub/boot/diskless/ --transform 's/$/.new/' -xzf -

* Shutdown the VM

.. code-block:: console

  (pcocc/XXXXX) # ^D
  Terminating the cluster...

.. _image_variables:

* Define destination image and key

.. code-block:: shell

  export KEY=/volspoms1/diskless/keys/stacker-image-$(date +%F).key
  export RAW_IMG=/volspoms1/diskless/images/raw/stacker-image-$(date +%F).raw
  export ENC_IMG=/volspoms1/diskless/images/encrypted/stacker-image-$(date +%F).img

* Copy the qcow2 image into a raw image using **qemu-img**

.. code-block:: console

  # qemu-img convert -f qcow2 -O raw gluster://top1/volspoms1/pcocc/persistent_drives/${COMPUTE_NODE}.qcow2 gluster://top1${RAW_IMG}
  [2020-01-14 14:43:29.803547] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up.
  [2020-01-14 14:43:29.803936] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-2: All subvolumes are down. Going offline until atleast one of them comes back up.
  [2020-01-14 14:43:29.804297] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-3: All subvolumes are down. Going offline until atleast one of them comes back up.
  [2020-01-14 14:43:29.804641] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-4: All subvolumes are down. Going offline until atleast one of them comes back up.
  [2020-01-14 14:43:29.804978] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-5: All subvolumes are down. Going offline until atleast one of them comes back up.
  [...]


* Finally, encrypt (while copying) the image.

.. code-block:: console

  # stacker lio encrypt -k ${KEY} -s ${RAW_IMG} -d ${ENC_IMG}

.. note:: *Stacker* will create `${KEY}` and encrypt `${RAW_IMG}` with it.


Exporting image to compute nodes
--------------------------------

Once diskless image generated, export it to the nodes

* Define the nodes

.. code-block:: shell

  export COMPUTE_NODES=ocean[1-1000]

* Export previously generated image

.. code-block:: shell

  IMG_NAME="compute_img-$(date +%F)"

  clush -S -bw iscsi_srv[1-2] stacker lio export -n ${IMG_NAME} -W ${IMG_NAME} -d ${ENC_IMG} -w ${COMPUTE_NODES}
  clush -S -bw iscsi_srv[1-2] stacker lio config --save


.. note:: Keyfile and image are accessed by the VM through GlusterFS. They are
   defined in :ref:`image and key variables definition<image_variables>`.


Accessing image on the compute node
-----------------------------------

As said in previous sections, compute node boot on DisklessTrap_. After the
boot process node need to be configured to mount the exported compute
image.

* Configure *iSCSI* client

.. code-block:: shell

  cat << EOF | clush -bw ${COMPUTE_NODES}
  cat > /etc/iscsi/iscsid.conf << EO_ISCSI
  iscsid.startup = /bin/systemctl start iscsid.socket iscsiuio.socket
  node.startup = automatic
  node.leading_login = No
  node.session.timeo.replacement_timeout = 15
  node.conn[0].timeo.login_timeout = 15
  node.conn[0].timeo.logout_timeout = 15
  node.session.err_timeo.abort_timeout = 15
  node.session.err_timeo.lu_reset_timeout = 30
  node.session.err_timeo.tgt_reset_timeout = 30
  node.session.initial_login_retry_max = 8
  node.session.cmds_max = 128
  node.session.queue_depth = 32
  node.session.xmit_thread_priority = -20
  node.session.iscsi.InitialR2T = No
  node.session.iscsi.ImmediateData = Yes
  node.session.iscsi.FirstBurstLength = 262144
  node.session.iscsi.MaxBurstLength = 16776192
  node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
  node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
  discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
  node.conn[0].iscsi.HeaderDigest = None
  node.session.nr_sessions = 1
  node.session.iscsi.FastAbort = Yes
  node.session.scan = auto
  discovery.sendtargets.auth.authmethod = CHAP
  discovery.sendtargets.auth.username = disco_user
  discovery.sendtargets.auth.password = disco_pass
  discovery.sendtargets.auth.username_in = disco_mutual_user
  discovery.sendtargets.auth.password_in = disco_mutual_pass
  node.session.auth.authmethod = CHAP
  node.session.auth.username_in = node_mutual_user
  node.session.auth.password_in = node_mutual_pass
  node.session.auth.username = node_user
  node.session.auth.password = node_pass
  EO_ISCSI
  EOF


* Discover iscsi server targets

.. code-block:: shell

  for server in $(nodeset -e iscsi_srv[1-2])
  do
    iscsi_prefix=$(ssh server "awk -F= '/^wwn_target_prefix/ {print \$2}' /etc/stacker/stacker.conf)
	clush -bw ${COMPUTE_NODES} iscsiadm -m discovery -t st -p ${server}
	clush -bw ${COMPUTE_NODES} iscsiadm -m node -T ${iscsi_prefix}${IMG_NAME} -p ${server}:3260 -l
  done

.. note:: The default port for iSCSI server is 3260.
.. note:: `iscsi_srv[1-2]` are iscsi servers serving ${COMPUTE_NODES}.
   This list should be adapted regarding the cluster architecture.


* Configure and launch multipath

.. code-block:: shell

  cat << EOF | clush -bw ${COMPUTE_NODES}
  cat > /etc/multipath.conf << EO_MULTIPATH
  defaults {
	polling_interval        10
	failback                immediate
	no_path_retry           queue
	user_friendly_names     yes
	find_multipaths         yes
	prio                    random
	uid_attribute           ID_FS_UUID
  }
  blacklist {
	devnode "^zram.*"
  }
  EO_MULTIPATH
  multipathd
  EOF

* Copy the image key to the nodes

.. code-block:: shell

  clush -bw ${COMPUTE_NODES} --copy ${KEY} --dest /dev/shm/luksKey

.. _open_luks_device:

* Open luks device

.. code-block:: shell

  cat <<EOF | clusb -bw ${COMPUTE_NODES}
  scsi_id=$(udevadm info -q property /dev/sda | grep ^ID_FS_UUID= | sed 's/ID_FS_UUID=//')
  cryptsetup luksOpen -d /dev/shm/luksKey /dev/mapper/${scsi_id} luks_root
  EOF

.. note:: The device used here is `/dev/sda` and should be modified regarding
   server exported images and node configuration.

* Cleanup key

.. code-block:: shell

  clush -bw ${COMPUTE_NODES} rm /dev/shm/luksKey

* Get device patitions

.. code-block:: shell

  cat <<EOF | clush -bw ${COMPUTE_NODES}
  partprobe /dev/mapper/luks_root
  for device in /dev/mapper/* ; do
    blockdev --setro $device
    lvm lvchange -ay $device
  done
  EOF

.. note:: We explicitly set devices `read-only` here to workaround a miss
   detection of the `read-only` property in `device_mapper` kernel module.

* Prepare directory tree

.. code-block:: shell

  clush -bw ${COMPUTE_NODES} "mkdir -p /overlay/upper/{root,var} /overlay/work/{root,var} /overlay/lower/{root,var}"

* Prepare Zram for overlay

.. code-block:: shell

  cat <<EOF | clush -bw ${COMPUTE_NODES}
  mkfs.xfs -f /dev/zram0
  mount /dev/zram0 /overlay
  EOF


* Mount device partitions

.. code-block:: shell

  cat <<EOF| clush -bw ${COMPUTE_NODES}
  mount -o ro,_netdev /dev/mapper/luks_root1 /overlay/lower/root
  mount -o ro,_netdev /dev/mapper/system-var /overlay/lower/var
  EOF

* Mount the root filesystem

.. code-block:: shell

  cat <<EOF| clush -bw ${COMPUTE_NODES}
  mount -t overlay overlay -olowerdir=/overlay/lower/root,upperdir=/overlay/upper/root,workdir=/overlay/work/root,_netdev /sysroot || exit 1
  mount -t overlay overlay -olowerdir=/overlay/lower/var,upperdir=/overlay/upper/var,workdir=/overlay/work/var,_netdev /sysroot/var || exit 1
  EOF


* Configure needed services

.. code-block:: shell

  cat <<EOF| clush -bw ${COMPUTE_NODES}
  mkdir -p /sysroot/etc/iscsi
  cp /etc/iscsi/iscsid.conf /sysroot/etc/iscsi/iscsid.conf
  cp /etc/multipath.conf /sysroot/etc/multipath.conf
  cp /etc/iscsi/initiatorname.iscsi /sysroot/etc/iscsi/initiatorname.iscsi
  systemctl --root /sysroot/ enable iscsid.service
  systemctl --root /sysroot/ enable multipathd.service
  rm -f /sysroot/.autorelabel
  hostname > /sysroot/etc/hostname
  EOF

* Launch boot sequence

.. code-block:: shell

  clush -bw ${COMPUTE_NODES} systemctl stop dracut-emergency.service


Deactivate multipath
********************

In previous section access to iscsi server is made with multipath configured,
this section describe how to deactivate this feature.

* iSCSI configuration

.. code-block:: shell

  cat << EOF | clush -bw ${COMPUTE_NODES}
  cat > /etc/iscsi/iscsid.conf << EO_ISCSI
  iscsid.startup = /bin/systemctl start iscsid.socket iscsiuio.socket
  node.startup = automatic
  node.leading_login = No
  node.session.timeo.replacement_timeout = 600
  node.conn[0].timeo.login_timeout = 15
  node.conn[0].timeo.logout_timeout = 15
  node.conn[0].timeo.noop_out_interval = 0
  node.conn[0].timeo.noop_out_timeout = 0
  node.session.err_timeo.abort_timeout = 15
  node.session.err_timeo.lu_reset_timeout = 30
  node.session.err_timeo.tgt_reset_timeout = 30
  node.session.initial_login_retry_max = 8
  node.session.cmds_max = 128
  node.session.queue_depth = 32
  node.session.xmit_thread_priority = -20
  node.session.iscsi.InitialR2T = No
  node.session.iscsi.ImmediateData = Yes
  node.session.iscsi.FirstBurstLength = 262144
  node.session.iscsi.MaxBurstLength = 16776192
  node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
  node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
  discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
  node.conn[0].iscsi.HeaderDigest = None
  node.session.nr_sessions = 1
  node.session.iscsi.FastAbort = Yes
  node.session.scan = auto
  EO_ISCSI
  EOF

* Remove multipath configuration

.. code-block:: shell

  cat << EOF | clush -bw ${COMPUTE_NODES}
  rm /etc/multipath.conf
  pkill multipathd
  EOF

* Modify :ref:`this step<open_luks_device>` when decrypting luks device with:

.. code-block:: shell

  clush -bw ${COMPUTE_NODES} cryptsetup luksOpen -d /dev/shm/luksKey /dev/sda

.. note:: The device used here is `/dev/sda` and should be modified regarding
   server exported images and node configuration.


DisklessTrap initramfs
----------------------

.. _DisklessTrap:

We provide a dracut module to manage diskless boot.

It generates an initramfs to *Trap* the node boot proccess. Once the node in
this state, it will be accessed though *ssh* with needed tools to boot with any
diskless method supported by Ocean.

Installation is done automaticaly by puppet during the VM boot.
It can be done manualy by installing ``dracut-ccc-modules`` package.

.. code-block:: shell

 dnf install -y dracut-ccc-modules

Puppet will configure *DisklessTrap* in order to generate a full featured
diskless initramfs image.

To update initramfs image content and behaviour check
``/etc/dracut-ccc-modules.conf``, then update it with: ``dracut -fMv``.

More information here ``man DisklessTrap``.