=========================
Mount namespaces handbook
=========================

This handbook is intended to describe the usage of mount namespaces and polyinstanciated directories in a HPC context.

There is currently two main use-cases for this:

User isolation
  Users should not be able to use any local storage, like ``/tmp`` or ``/dev/shm``, to share files with other.

  All local files should be only visible to their owners and sharing files between users should only use shared filesystems like **Lustre** or **NFS**.

Per-users mount rights
  In some cases, users of different groups (so called *containers*) are allowed to access files of another group. These kind of access are hierarchical and read-only.

  Given two groups **A** and **B**, we have to provide a solution to allow a read-only access of **A**'s users files to **B**'s users. But **A**'s users must not access **B**'s files.

Implementation
--------------

To implement those use-cases, we are using the ``pam_namespace`` as the most used piece of software to do that.

The ``pam_namespace`` PAM module sets up a private namespace for a session with polyinstantiated directories. A polyinstantiated directory provides a different instance of itself based on user name.

The ``pam_namespace`` module disassociates the session namespace from the parent namespace. Any mounts/unmounts performed in the parent namespace, such as mounting of devices, are not reflected in the session namespace. Only original mount points are reflected.

There is two ways of doing that using the ``pam_namespace``:

* Describing polyinstanciated directories and the base storage device to use.
* Using a script to manually populate the mount namespace using a given base directory.

Basic usage
^^^^^^^^^^^

After adding the ``pam_namespace`` in the service's PAM configuration, the namespace configuartion is done in ``/sec/security/namespace.d/`` files.

A basic usage is simply:

.. code-block:: shell

  # POLYDIR INSTANCE_PREFIX METHOD UIDS
  /tmp      /tmp/poly-inst  user   root

This configuration tells that for each user that is **not** ``root`` (4th field), remount the ``/tmp`` (1st field) using the username (3rd field) as a template of unique folders within the instance prefix (2nd field).

To keep things clear, this will bind-mount a private ``/tmp/poly-inst/$USER`` onto ``/tmp`` for each (non-root) user.

To ensure that isolation is complete, the ``pam_namespace`` **requires** that the instance prefix directory is not accessible to any-one (ie. root-owner and ``000`` mode).

The complete format is documented in ``namespace.conf(5)`` man page.

A side-effect of this is that even system daemons cannot use files of user's ``/tmp``.
Kerberos is one of them : if Kerberos is configured to use ``/tmp`` to store credential cache, when the users logs-in the credential cache may be written in the *wrong* ``/tmp``.

As such, using polyintanciated directories induces that system daemon cannot share files with users. For kerberos, the usage of ``KCM`` credential caches is a way to do. See :ref:`Kerberos handbook` for details about Kerberos.

This basic usage implements the first use case : **User isolation**


Customized usage
^^^^^^^^^^^^^^^^

A more advanced usage of the ``pam_namespace`` is to use custom scripts to setup the user's namespace. Ocean's delivers a script that wraps most of the heavy work.

You can enable the usage of this with the following configuration:

.. code-block:: none

  # POLYDIR INSTANCE_PREFIX METHOD                                                         UIDS
  /ccc      none            tmpfs:create=0700,root,root:mntopts=size=1M:iscript=ccc.setup  root

This configuration mount an empty /ccc using a ``tmpfs``. The ``tmpfs`` is set-up with some option (items separated with commas) to change permissions and size of ``/ccc``.

The ``iscript`` configuration configures a script to be executed on namespace initialization. This script location is relative to ``/etc/security/namespace.d``.

The ``pam_namespace`` gives 4 arguments :

* The ``polydir`` (ie. ``/ccc`)
* The instance path (ie ``tmpfs`` here, ``/tmp/poly-inst/$USER`` for the previous use-case)
* A boolean indicated if the instance path was newly created. Always true for tmpfs.
* The username

Ocean's ``ccc.setup`` script uses another configuration file that is quite similar to the ``/etc/fstab`` file. This file is located in ``/etc/fstab_user`` and follows the following format :

.. code-block:: none

  # SOURCE                DESTINATION    KIND     OPTS         GROUP_FILTER
  /run/mount/private/A    A     	 bind     defaults,rw  A
  /run/mount/private/B    B    	         bind     defaults,rw  B
  /run/mount/private/A    A    	         bind     defaults,ro  B
  A                       C              symlink  defaults     A
  B                       C              symlink  defaults     B

Each line corresponds to a new folder or symbolic link (given the 3rd field) relative to the instance path. The name of this is given with the 2nd field.

For bind mounts, the 1st field is the mount point to bind-on. For symbolic links, the 1st field is the target of the link.

The 4th field is only used for bind mounts and indicates mount options to use when creating the bind mount. In most cases, only `ro` or `rw` is used. ``defaults`` is a keyword for no additional options.

The 5th field indicates a group filter for the mount/link of the given line. The filter can be negated by prepending a ``~`` in front of the filter, multiple groups are comma-separated.

The this configuration gives the following result for group **A**'s users:

``/ccc/A``: Bind-mount (read/write) on /run/mount/private/A

``/ccc/C``: Symbolic-link to **/ccc/A**

And for **B**'s users:

``/ccc/A``: Bind-mount (read-only) on /run/mount/private/A

``/ccc/B``: Bind-mount (read/write) on /run/mount/private/B

``/ccc/C``: Symbolic-link to **/ccc/B**


Debug
^^^^^

To list mount namespaces currently used, execute the ``lsns`` command. This will list a namespaces and associated process.

To enter an already existing mount namespace, you can use the ``nsenter`` command :

.. code-block:: console

  # nsenter -t %PID -m
  # [ Shell inside namespace.. ]

To debug the script's execution, you can use the ``execsnoop`` tool from the ``bcc-tools`` package : ``/usr/share/bcc/tools/execsnoop``.