Slurm¶
The slurm
pack is a high-level pack that provide actions for managing slurm on a cluster. It also provides a sensor for event detection based on slurm’s strigger
. Most actions and sensor are implemented using the PySlurm
library.
Note that pyslurm
must be installed on the host because of Slurm’s library dependecies that forbids installation inside StackStorm’s virtual environment.
Actions¶
- slurm.config.get
Get one (or all) configuration item.
If the
item
parameter is provided only the matching slurm parameter is returned.Result is a hash of parameter names and values
Note that parameter names are camel-cased in
slurm.conf
but stored as snake-cased variables names.# st2 run slurm.config.get item=topology_plugin id: 5ea03db8049f2e425784c6ba status: succeeded parameters: item: topology_plugin result: exit_code: 0 result: topology_plugin: topology/tree stderr: '' stdout: ''
- slurm.job.find
Find jobs that match given criterias.
The
criterias
parameter is required and available criterias can be found by launchingpython -c 'import pyslurm, json; print json.dumps(pyslurm.job().get().values()[0], indent=4)'
.Result is a list of job IDs.
# st2 run slurm.job.find criterias='{"batch_host": "irene4000"}' id: 5ea03eb4049f2e425784c6bd status: succeeded parameters: criterias: batch_host: irene4000 result: exit_code: 0 result: - 223772 stderr: '' stdout: ''
- slurm.job.get
Get jobs details for given job IDs.
The
jobids
parameter is required and is a list of job IDs to queryResult is a hash of job ID and job details.
# st2 run slurm.job.get jobids=223473 id: 5ea03fc6049f2e425784c6d2 status: succeeded parameters: jobids: - 223473 result: exit_code: 0 result: '223473': - accrue_time: '2020-04-22T09:56:41' admin_comment: null alloc_node: irene194 alloc_sid: 147033 array_job_id: null array_max_tasks: null array_task_id: null array_task_str: null [...] stderr: '' stdout: ''
- slurm.job.step.get
Get job steps details for given job IDs
The
jobids
parameter is required and is a list of job IDs to queryResult is a hash of job ID and job step details (if any).
# st2 run slurm.job.step.get jobids=223473 id: 5ea04034049f2e425784c6d5 status: succeeded parameters: jobids: - 223473 result: exit_code: 0 result: '223473': {} stderr: '' stdout: ''
- slurm.node.down
Set given nodes as downed with the given reason
The
nodes
parameter is required and represent the nodes to set as downedOptionaly you can modify the
reason
parameter that represents the reason set in node’s status. This defaults toST2: Downed using slurm.node.down action
- slurm.node.drain
Set given nodes as drained with the given reason
The
nodes
parameter is required and represent the nodes to set as drainedOptionaly you can modify the
reason
parameter that represents the reason set in node’s status. This defaults toST2: Drained using slurm.node.drain action
- slurm.node.resume
Set given nodes as resumed
The
nodes
parameter is required and represent the nodes to set as resumed- slurm.node.get
Get node details.
The
nodes
parameter is required and represent the nodes to queryResult is a hash of node name and node details
# st2 run slurm.node.get nodes=irene4000 id: 5ea04179049f2e425784c6d8 status: succeeded parameters: nodes: - irene4000 result: exit_code: 0 result: irene4000: alloc_cpus: 0 alloc_mem: 0 arch: x86_64 boards: 1 boot_time: 1587369782 core_spec_cnt: 0 cores: 16 cores_per_socket: 16 cpu_load: 220 cpu_spec_list: [] cpus: 128 [...] version: '18.08' weight: 1 stderr: '' stdout: ''
- slurm.node.reserve
Add the given nodes inside a reservation.
The
nodes
parameter is required.Optionnaly, the
reservation_name
can be modified to change to reservation name to use.The
props
parameter can be used to modify the reservation properties. This defaults to a infinite reservation for the root user with MAINT, IGNORE_JOBS and OVERLAP flags. This parameter is a hash withstart_time
,duration
,users
andflags
keys. None of them are required. Available flags are written inside pack’s python library. Read the code for details.- slurm.node.unreserve
Remove the given nodes from a reservation. The reservation is deleted if empty.
The
nodes
parameter is required.Optionnaly, the
reservation_name
can be modified to change to reservation name to use.- slurm.node.update
Update the node parameter with given information.
The
nodes
andinfos
parameter are required.The
infos
parameter is a hash of item to update. Available configuration item can be found using theslurm.node.get
action result.- slurm.trigger.set
Sets a slurm trigger.
The
program
parameter is required and represents the program to launch on trigger executionThe
jobid
parameter is optional and represents the job ID to hook. This is used with thefini
parameter for setting triggers on job start or completion.The
offset
parameter is optional and represents the time between the event triggering and the trigger program being launched. Zero by default.The
fini
parameter is optional and defines if the trigger should be hooked on job start or job completion. Used with thejobid
parameter.The
nodes
parameter is optional and defines the nodes to monitor.The
event
is event name (for node triggers) on which this trigger should be launched. Either burst_buffer, drained, down, fail, up, idle or reconfig.See the
strigger
man page for details.Returns the trigger ID.
- slurm.trigger.unset
Unset a slurm trigger.
The
trigger_ids
parameter is required and represents an array trigger ID to unset.Optionnaly, the
user
may be set to change to trigger user. Defaults to slurm.
Sensor¶
Trigger and rules¶
The slurm.trigger.sensor is a passive sensor that emits triggers on slurm node state changes. It emits 4 kinds of triggers:
- slurm.node.state
A trigger describe a single node state change. Trigger payload contains the node hostname and node’s new state.
- slurm.sensor.trigger.up
A trigger fired when the sensor is going up. Trigger payload contains a UUID different for each sensor execution.
- slurm.sensor.trigger.down
A trigger fired when the sensor is going down. Trigger payload contains a UUID different for each sensor execution.
- slurm.sensor.trigger.triggered
A trigger fired when slurm triggered a execution. As slurm trigger are ephemeral, this is used to reset the trigger. Trigger payload contains the sensor UUID and the class and type of slurm trigger that as been triggered.
This sensor also ships with 4 StackStorm rules for setting up watched events
- slurm.sensor.trigger.setup.node_down.notify
A rule that sets up the down nodes slurm trigger.
It is triggered by the slurm.sensor.trigger.up trigger. And launches the slurm.trigger.set actions with the required parameters.
- slurm.sensor.trigger.setup.node_drain.notify
A rule that sets up the drained nodes slurm trigger.
It is triggered by the slurm.sensor.trigger.up trigger. And launches the slurm.trigger.set actions with the required parameters.
- slurm.sensor.trigger.setup.node_down.reset
A rule that resets the down nodes slurm trigger.
It it triggered by the slurm.sensor.trigger.triggered trigger when the payload is
{"class": "node", "type": "down"}
.Launches the slurm.trigger.set actions with the required parameters.
- slurm.sensor.trigger.setup.node_drain.reset
A rule that resets the down nodes slurm trigger.
It it triggered by the slurm.sensor.trigger.triggered trigger when the payload is
{"class": "node", "type": "down"}
.Launches the slurm.trigger.set actions with the required parameters.
Description¶
This sensor is working by opening up a small Flask server listening on a particular URL generated each time the sensor is executed. Listened URLs are like http://HOST:32000/UUID/CLASS/TYPE
where UUID is a per-sensor identifier, CLASS is node
(may be used to implement other kind of triggers later) and TYPE is node’s state as described by Slurm.
On startup, the sensor emits a slurm.sensor.trigger.up that is matched by the slurm.sensor.trigger.setup.node_down.notify and slurm.sensor.trigger.setup.node_drain.notify rules.
StackStorm then executes slurm.trigger.set twice with the following parameters:
event: "down"
program: "/usr/bin/curl -X POST http://{{ config_context.trigger_sensor_ip }}:32000/{{trigger.uuid}}/node/down -d"
event: "drain"
program: "/usr/bin/curl -X POST http://{{ config_context.trigger_sensor_ip }}:32000/{{trigger.uuid}}/node/drain -d"
When slurm launches either program, the sensor is notified and does the following:
It emits a slurm.sensor.trigger.triggered trigger that is matched by sensor.trigger.setup.node_down.reset or sensor.trigger.setup.node_drain.reset. Thus, StackStorm re-executes slurm.trigger.set.
Using Slurm’s input passed through curl’s request, it emits slurm.node.state for each node.
Missed events are not handled and may happen if slurm.trigger.set actions are delayed and cancelled. This should be monitored.