Diskless

The diskless pack is a high-level for diskless compute node management. It ships with some basic actions that reuses the cluster’s procedures implemented using milkcheck.

This pack does not include such implementations. This should be done in /etc/milkcheck/conf/.

Actions

diskless.milkcheck.(boot|prepare|open|status)

Actions that launches the boot, prepare, open or status actions of the compute_dkless service.

This milkcheck service must be implemented on system’s milkcheck configuration files (/etc/milkcheck/conf)

The hosts parameter is required.

Optionnaly, this action accepts the context which is a hash of variable to set in milkcheck.

diskless.remediate

This is a workflow that makes sure that the given nodes are correctly booted and ready to be used.

This action takes the following required parameters:

hosts

The hosts to check and remediate

image

The diskless image name. Used as the iscsi_image variable of milkcheck’s procedure.

vmlinuz, initrd

The name of vmlinuz and initrd images. Used as the kexec_vmlinuz_name and kexec_initrd_name of milkcheck’s procedure.

Defaults to vmlinux and initrd.

concurrency

The action concurrency of this workflow.

config_ttl

Configures the maximum time to wait for the node to go through the cloudinit step of the boot process. Defaults to 30 minutes.

boot_ttl

Configures the maximum time to wait for the node to go through the POST+ipxe step of the boot process. Defaults to 10 minutes

Pseudo-code workflow may look like the following

@startuml

start

:milkcheck compute_dkless boot_status;

if (all ok) then (yes)
  stop
else (no)
  if (node BMC is reachable) then (no)
      :power awake unreachable BMCs;
      :wait for BMC reachable;
      :power on powered-off nodes;
      :Wait for iPXE phone_home;
      if (OK ?) then (yes)
        :Wait for boot;
        :milkcheck compute_dkless boot;
        :Wait for cloud-init phone_home;
      else (no)
        :power sleep unreachable BMCs;
        :power awake unreachable BMCs;
        :wait for BMC reachable;
        :power on powered-off nodes;
        :Wait for iPXE phone_home;
        :Wait for boot;
        :milkcheck compute_dkless boot;
        :Wait for cloud-init phone_home;
      endif
  else (yes)
    if (node is powerred off) then (yes)
      :power on powered-off nodes;
      :Wait for iPXE phone_home;
      if (OK ?) then (yes)
        :Wait for boot;
        :milkcheck compute_dkless boot;
        :Wait for cloud-init phone_home;
      else (no)
        :power sleep unreachable BMCs;
        :power awake unreachable BMCs;
        :wait for BMC reachable;
        :power on powered-off nodes;
        :Wait for iPXE phone_home;
        :Wait for boot;
        :milkcheck compute_dkless boot;
        :Wait for cloud-init phone_home;
      endif
    else (no)
        if (Node is reachable ?) then (no)
          fork
           :Wait for cloud-init phone_home;
          fork again
           :Wait for iPXE phone_home;
           if (OK ?) then (yes)
             :Wait for boot;
             :milkcheck compute_dkless boot;
             :Wait for cloud-init phone_home;
           else (no)
             :power sleep unreachable BMCs;
             :power awake unreachable BMCs;
             :wait for BMC reachable;
             :power on powered-off nodes;
             :Wait for iPXE phone_home;
             :Wait for boot;
             :milkcheck compute_dkless boot;
             :Wait for cloud-init phone_home;
           endif
          end fork
        else (yes)
          if (Node is in diskless trap ?) then (yes)
            :milkcheck compute_dkless boot;
            :Wait for cloudinit phone_home call;
          else (no)
            if (Node is in cloud-init ?) then (yes)
              :Wait for cloudinit phone_home call;
            else (no)
              :fail;
            endif
          endif
        endif
      endif
    endif
endif

stop

@enduml

diskless.wait_for.(ipxe|cloudinit)

Actions that wait for a phone_home call.

These actions are using inquiries and generated-rules within a workflow to wait for a diskless.cloudinit.phone_home or diskless.ipxe.phone_home trigger for the given node. This action requires that the diskless.wait_(ipxe|cloudinit).arming rule is present and activated.

The host parameter is required.

The ttl defines how long (in minutes) this workflow should wait. This requires inquiries garbage collection to be enabled (purge_inquiries in garbagecollector section of st2.conf*). Defaults to 10 minutes

Because of StackStorm internal timers, ttl values below 10 mintues may not timeout immediatly.

The sequence of these actions is a bit tricky, here’s a quick sequence diagram for diskless.wait_for.ipxe (pretty much the same as cloudinit ‘s one).

Name

Description

Action

diskless.wait_for.ipxe action

Rule A

diskless.wait_ipxe.arming rule

Rule B

diskless.wait_ipxe.arming.NODE Rule

Node

A booting node

Sensor

The diskless.phone_home.sensor is a simple sensor that listen for events comming from the boot process. It uses the phone_home cloud-init module to post some data into this sensor and a simple imgfetch iPXE command. phone_home cloud-init or iPXE configuration itself is not handled here, this sensor only listens for phone_home events.

This sensor is a Flask server listening on 32001/tcp that triggers a diskless.cloudinit.phone_home with the data encoded (URL-encoded) by cloud-init or diskless.ipxe.phone_home when triggered from iPXE.

Triggers

diskless.cloudinit.phone_home

A trigger that indicates that a node almost finished it’s cloud-init process.

The payload contains details about the node:

  • pub_key_dsa, pub_key_rsa and pub_key_ecdsa: SSH host key present on the node.

  • instance_id: cloud-init’s instance-id, may be derived from the hostname

  • hostname, fqdn: Node hostname and fqdn

diskless.ipxe.phone_home

A trigger that indicates that a node is currently booting and is in the iPXE step

The payload only contains the node hostname (hostname).