Slurm¶
The slurm pack is a high-level pack that provide actions for managing slurm on a cluster. It also provides a sensor for event detection based on slurm’s strigger. Most actions and sensor are implemented using the PySlurm library.
Note that pyslurm must be installed on the host because of Slurm’s library dependecies that forbids installation inside StackStorm’s virtual environment.
Actions¶
- slurm.config.get
Get one (or all) configuration item.
If the
itemparameter is provided only the matching slurm parameter is returned.Result is a hash of parameter names and values
Note that parameter names are camel-cased in
slurm.confbut stored as snake-cased variables names.# st2 run slurm.config.get item=topology_plugin id: 5ea03db8049f2e425784c6ba status: succeeded parameters: item: topology_plugin result: exit_code: 0 result: topology_plugin: topology/tree stderr: '' stdout: ''
- slurm.job.find
Find jobs that match given criterias.
The
criteriasparameter is required and available criterias can be found by launchingpython -c 'import pyslurm, json; print json.dumps(pyslurm.job().get().values()[0], indent=4)'.Result is a list of job IDs.
# st2 run slurm.job.find criterias='{"batch_host": "irene4000"}' id: 5ea03eb4049f2e425784c6bd status: succeeded parameters: criterias: batch_host: irene4000 result: exit_code: 0 result: - 223772 stderr: '' stdout: ''
- slurm.job.get
Get jobs details for given job IDs.
The
jobidsparameter is required and is a list of job IDs to queryResult is a hash of job ID and job details.
# st2 run slurm.job.get jobids=223473 id: 5ea03fc6049f2e425784c6d2 status: succeeded parameters: jobids: - 223473 result: exit_code: 0 result: '223473': - accrue_time: '2020-04-22T09:56:41' admin_comment: null alloc_node: irene194 alloc_sid: 147033 array_job_id: null array_max_tasks: null array_task_id: null array_task_str: null [...] stderr: '' stdout: ''
- slurm.job.step.get
Get job steps details for given job IDs
The
jobidsparameter is required and is a list of job IDs to queryResult is a hash of job ID and job step details (if any).
# st2 run slurm.job.step.get jobids=223473 id: 5ea04034049f2e425784c6d5 status: succeeded parameters: jobids: - 223473 result: exit_code: 0 result: '223473': {} stderr: '' stdout: ''
- slurm.node.down
Set given nodes as downed with the given reason
The
nodesparameter is required and represent the nodes to set as downedOptionaly you can modify the
reasonparameter that represents the reason set in node’s status. This defaults toST2: Downed using slurm.node.down action- slurm.node.drain
Set given nodes as drained with the given reason
The
nodesparameter is required and represent the nodes to set as drainedOptionaly you can modify the
reasonparameter that represents the reason set in node’s status. This defaults toST2: Drained using slurm.node.drain action- slurm.node.resume
Set given nodes as resumed
The
nodesparameter is required and represent the nodes to set as resumed- slurm.node.get
Get node details.
The
nodesparameter is required and represent the nodes to queryResult is a hash of node name and node details
# st2 run slurm.node.get nodes=irene4000 id: 5ea04179049f2e425784c6d8 status: succeeded parameters: nodes: - irene4000 result: exit_code: 0 result: irene4000: alloc_cpus: 0 alloc_mem: 0 arch: x86_64 boards: 1 boot_time: 1587369782 core_spec_cnt: 0 cores: 16 cores_per_socket: 16 cpu_load: 220 cpu_spec_list: [] cpus: 128 [...] version: '18.08' weight: 1 stderr: '' stdout: ''
- slurm.node.reserve
Add the given nodes inside a reservation.
The
nodesparameter is required.Optionnaly, the
reservation_namecan be modified to change to reservation name to use.The
propsparameter can be used to modify the reservation properties. This defaults to a infinite reservation for the root user with MAINT, IGNORE_JOBS and OVERLAP flags. This parameter is a hash withstart_time,duration,usersandflagskeys. None of them are required. Available flags are written inside pack’s python library. Read the code for details.- slurm.node.unreserve
Remove the given nodes from a reservation. The reservation is deleted if empty.
The
nodesparameter is required.Optionnaly, the
reservation_namecan be modified to change to reservation name to use.- slurm.node.update
Update the node parameter with given information.
The
nodesandinfosparameter are required.The
infosparameter is a hash of item to update. Available configuration item can be found using theslurm.node.getaction result.- slurm.trigger.set
Sets a slurm trigger.
The
programparameter is required and represents the program to launch on trigger executionThe
jobidparameter is optional and represents the job ID to hook. This is used with thefiniparameter for setting triggers on job start or completion.The
offsetparameter is optional and represents the time between the event triggering and the trigger program being launched. Zero by default.The
finiparameter is optional and defines if the trigger should be hooked on job start or job completion. Used with thejobidparameter.The
nodesparameter is optional and defines the nodes to monitor.The
eventis event name (for node triggers) on which this trigger should be launched. Either burst_buffer, drained, down, fail, up, idle or reconfig.See the
striggerman page for details.Returns the trigger ID.
- slurm.trigger.unset
Unset a slurm trigger.
The
trigger_idsparameter is required and represents an array trigger ID to unset.Optionnaly, the
usermay be set to change to trigger user. Defaults to slurm.
Sensor¶
Trigger and rules¶
The slurm.trigger.sensor is a passive sensor that emits triggers on slurm node state changes. It emits 4 kinds of triggers:
- slurm.node.state
A trigger describe a single node state change. Trigger payload contains the node hostname and node’s new state.
- slurm.sensor.trigger.up
A trigger fired when the sensor is going up. Trigger payload contains a UUID different for each sensor execution.
- slurm.sensor.trigger.down
A trigger fired when the sensor is going down. Trigger payload contains a UUID different for each sensor execution.
- slurm.sensor.trigger.triggered
A trigger fired when slurm triggered a execution. As slurm trigger are ephemeral, this is used to reset the trigger. Trigger payload contains the sensor UUID and the class and type of slurm trigger that as been triggered.
This sensor also ships with 4 StackStorm rules for setting up watched events
- slurm.sensor.trigger.setup.node_down.notify
A rule that sets up the down nodes slurm trigger.
It is triggered by the slurm.sensor.trigger.up trigger. And launches the slurm.trigger.set actions with the required parameters.
- slurm.sensor.trigger.setup.node_drain.notify
A rule that sets up the drained nodes slurm trigger.
It is triggered by the slurm.sensor.trigger.up trigger. And launches the slurm.trigger.set actions with the required parameters.
- slurm.sensor.trigger.setup.node_down.reset
A rule that resets the down nodes slurm trigger.
It it triggered by the slurm.sensor.trigger.triggered trigger when the payload is
{"class": "node", "type": "down"}.Launches the slurm.trigger.set actions with the required parameters.
- slurm.sensor.trigger.setup.node_drain.reset
A rule that resets the down nodes slurm trigger.
It it triggered by the slurm.sensor.trigger.triggered trigger when the payload is
{"class": "node", "type": "down"}.Launches the slurm.trigger.set actions with the required parameters.
Description¶
This sensor is working by opening up a small Flask server listening on a particular URL generated each time the sensor is executed. Listened URLs are like http://HOST:32000/UUID/CLASS/TYPE where UUID is a per-sensor identifier, CLASS is node (may be used to implement other kind of triggers later) and TYPE is node’s state as described by Slurm.
On startup, the sensor emits a slurm.sensor.trigger.up that is matched by the slurm.sensor.trigger.setup.node_down.notify and slurm.sensor.trigger.setup.node_drain.notify rules.
StackStorm then executes slurm.trigger.set twice with the following parameters:
event: "down"
program: "/usr/bin/curl -X POST http://{{ config_context.trigger_sensor_ip }}:32000/{{trigger.uuid}}/node/down -d"
event: "drain"
program: "/usr/bin/curl -X POST http://{{ config_context.trigger_sensor_ip }}:32000/{{trigger.uuid}}/node/drain -d"
When slurm launches either program, the sensor is notified and does the following:
It emits a slurm.sensor.trigger.triggered trigger that is matched by sensor.trigger.setup.node_down.reset or sensor.trigger.setup.node_drain.reset. Thus, StackStorm re-executes slurm.trigger.set.
Using Slurm’s input passed through curl’s request, it emits slurm.node.state for each node.
Missed events are not handled and may happen if slurm.trigger.set actions are delayed and cancelled. This should be monitored.