Slurm ^^^^^ The ``slurm`` pack is a high-level pack that provide actions for managing slurm on a cluster. It also provides a sensor for event detection based on slurm's ``strigger``. Most actions and sensor are implemented using the ``PySlurm`` library. Note that ``pyslurm`` must be installed on the host because of Slurm's library dependecies that forbids installation inside StackStorm's virtual environment. Actions """"""" **slurm.config.get** Get one (or all) configuration item. If the ``item`` parameter is provided only the matching slurm parameter is returned. Result is a hash of parameter names and values Note that parameter names are camel-cased in ``slurm.conf`` but stored as snake-cased variables names. .. code-block:: yaml # st2 run slurm.config.get item=topology_plugin id: 5ea03db8049f2e425784c6ba status: succeeded parameters: item: topology_plugin result: exit_code: 0 result: topology_plugin: topology/tree stderr: '' stdout: '' **slurm.job.find** Find jobs that match given criterias. The ``criterias`` parameter is required and available criterias can be found by launching ``python -c 'import pyslurm, json; print json.dumps(pyslurm.job().get().values()[0], indent=4)'``. Result is a list of job IDs. .. code-block:: yaml # st2 run slurm.job.find criterias='{"batch_host": "irene4000"}' id: 5ea03eb4049f2e425784c6bd status: succeeded parameters: criterias: batch_host: irene4000 result: exit_code: 0 result: - 223772 stderr: '' stdout: '' **slurm.job.get** Get jobs details for given job IDs. The ``jobids`` parameter is required and is a list of job IDs to query Result is a hash of job ID and job details. .. code-block:: yaml # st2 run slurm.job.get jobids=223473 id: 5ea03fc6049f2e425784c6d2 status: succeeded parameters: jobids: - 223473 result: exit_code: 0 result: '223473': - accrue_time: '2020-04-22T09:56:41' admin_comment: null alloc_node: irene194 alloc_sid: 147033 array_job_id: null array_max_tasks: null array_task_id: null array_task_str: null [...] stderr: '' stdout: '' **slurm.job.step.get** Get job steps details for given job IDs The ``jobids`` parameter is required and is a list of job IDs to query Result is a hash of job ID and job step details (if any). .. code-block:: yaml # st2 run slurm.job.step.get jobids=223473 id: 5ea04034049f2e425784c6d5 status: succeeded parameters: jobids: - 223473 result: exit_code: 0 result: '223473': {} stderr: '' stdout: '' **slurm.node.down** Set given nodes as *downed* with the given reason The ``nodes`` parameter is required and represent the nodes to set as *downed* Optionaly you can modify the ``reason`` parameter that represents the reason set in node's status. This defaults to ``ST2: Downed using slurm.node.down action`` **slurm.node.drain** Set given nodes as *drained* with the given reason The ``nodes`` parameter is required and represent the nodes to set as *drained* Optionaly you can modify the ``reason`` parameter that represents the reason set in node's status. This defaults to ``ST2: Drained using slurm.node.drain action`` **slurm.node.resume** Set given nodes as *resumed* The ``nodes`` parameter is required and represent the nodes to set as *resumed* **slurm.node.get** Get node details. The ``nodes`` parameter is required and represent the nodes to query Result is a hash of node name and node details .. code-block:: yaml # st2 run slurm.node.get nodes=irene4000 id: 5ea04179049f2e425784c6d8 status: succeeded parameters: nodes: - irene4000 result: exit_code: 0 result: irene4000: alloc_cpus: 0 alloc_mem: 0 arch: x86_64 boards: 1 boot_time: 1587369782 core_spec_cnt: 0 cores: 16 cores_per_socket: 16 cpu_load: 220 cpu_spec_list: [] cpus: 128 [...] version: '18.08' weight: 1 stderr: '' stdout: '' **slurm.node.reserve** Add the given nodes inside a reservation. The ``nodes`` parameter is required. Optionnaly, the ``reservation_name`` can be modified to change to reservation name to use. The ``props`` parameter can be used to modify the reservation properties. This defaults to a infinite reservation for the root user with MAINT, IGNORE_JOBS and OVERLAP flags. This parameter is a hash with ``start_time``, ``duration``, ``users`` and ``flags`` keys. None of them are required. Available flags are written inside pack's python library. Read the code for details. **slurm.node.unreserve** Remove the given nodes from a reservation. The reservation is deleted if empty. The ``nodes`` parameter is required. Optionnaly, the ``reservation_name`` can be modified to change to reservation name to use. **slurm.node.update** Update the node parameter with given information. The ``nodes`` and ``infos`` parameter are required. The ``infos`` parameter is a hash of item to update. Available configuration item can be found using the ``slurm.node.get`` action result. **slurm.trigger.set** Sets a slurm trigger. The ``program`` parameter is required and represents the program to launch on trigger execution The ``jobid`` parameter is optional and represents the job ID to hook. This is used with the ``fini`` parameter for setting triggers on job start or completion. The ``offset`` parameter is optional and represents the time between the event triggering and the trigger program being launched. Zero by default. The ``fini`` parameter is optional and defines if the trigger should be hooked on job start or job completion. Used with the ``jobid`` parameter. The ``nodes`` parameter is optional and defines the nodes to monitor. The ``event`` is event name (for node triggers) on which this trigger should be launched. Either *burst_buffer*, *drained*, *down*, *fail*, *up*, *idle* or *reconfig*. See the ``strigger`` man page for details. Returns the trigger ID. **slurm.trigger.unset** Unset a slurm trigger. The ``trigger_ids`` parameter is required and represents an array trigger ID to unset. Optionnaly, the ``user`` may be set to change to trigger user. Defaults to *slurm*. Sensor """""" Trigger and rules ''''''''''''''''' The **slurm.trigger.sensor** is a passive sensor that emits triggers on slurm node state changes. It emits 4 kinds of triggers: **slurm.node.state** A trigger describe a single node state change. Trigger payload contains the node hostname and node's new state. **slurm.sensor.trigger.up** A trigger fired when the sensor is going up. Trigger payload contains a UUID different for each sensor execution. **slurm.sensor.trigger.down** A trigger fired when the sensor is going down. Trigger payload contains a UUID different for each sensor execution. **slurm.sensor.trigger.triggered** A trigger fired when slurm triggered a execution. As slurm trigger are ephemeral, this is used to reset the trigger. Trigger payload contains the sensor UUID and the class and type of slurm trigger that as been triggered. This sensor also ships with 4 StackStorm rules for setting up watched events **slurm.sensor.trigger.setup.node_down.notify** A rule that sets up the down nodes slurm trigger. It is triggered by the **slurm.sensor.trigger.up** trigger. And launches the **slurm.trigger.set** actions with the required parameters. **slurm.sensor.trigger.setup.node_drain.notify** A rule that sets up the drained nodes slurm trigger. It is triggered by the **slurm.sensor.trigger.up** trigger. And launches the **slurm.trigger.set** actions with the required parameters. **slurm.sensor.trigger.setup.node_down.reset** A rule that resets the down nodes slurm trigger. It it triggered by the **slurm.sensor.trigger.triggered** trigger when the payload is ``{"class": "node", "type": "down"}``. Launches the **slurm.trigger.set** actions with the required parameters. **slurm.sensor.trigger.setup.node_drain.reset** A rule that resets the down nodes slurm trigger. It it triggered by the **slurm.sensor.trigger.triggered** trigger when the payload is ``{"class": "node", "type": "down"}``. Launches the **slurm.trigger.set** actions with the required parameters. Description ''''''''''' This sensor is working by opening up a small `Flask` server listening on a particular URL generated each time the sensor is executed. Listened URLs are like ``http://HOST:32000/UUID/CLASS/TYPE`` where *UUID* is a per-sensor identifier, *CLASS* is ``node`` (may be used to implement other kind of triggers later) and TYPE is node's state as described by Slurm. On startup, the sensor emits a **slurm.sensor.trigger.up** that is matched by the **slurm.sensor.trigger.setup.node_down.notify** and **slurm.sensor.trigger.setup.node_drain.notify** rules. StackStorm then executes **slurm.trigger.set** twice with the following parameters: .. code-block:: yaml event: "down" program: "/usr/bin/curl -X POST http://{{ config_context.trigger_sensor_ip }}:32000/{{trigger.uuid}}/node/down -d" .. code-block:: yaml event: "drain" program: "/usr/bin/curl -X POST http://{{ config_context.trigger_sensor_ip }}:32000/{{trigger.uuid}}/node/drain -d" When slurm launches either program, the sensor is notified and does the following: * It emits a **slurm.sensor.trigger.triggered** trigger that is matched by **sensor.trigger.setup.node_down.reset** or **sensor.trigger.setup.node_drain.reset**. Thus, StackStorm re-executes **slurm.trigger.set**. * Using Slurm's input passed through curl's request, it emits **slurm.node.state** for each node. Missed events are not handled and may happen if **slurm.trigger.set** actions are delayed and cancelled. This should be monitored.