Mount namespaces handbook

This handbook is intended to describe the usage of mount namespaces and polyinstanciated directories in a HPC context.

There is currently two main use-cases for this:

User isolation

Users should not be able to use any local storage, like /tmp or /dev/shm, to share files with other.

All local files should be only visible to their owners and sharing files between users should only use shared filesystems like Lustre or NFS.

Per-users mount rights

In some cases, users of different groups (so called containers) are allowed to access files of another group. These kind of access are hierarchical and read-only.

Given two groups A and B, we have to provide a solution to allow a read-only access of A’s users files to B’s users. But A’s users must not access B’s files.

Implementation

To implement those use-cases, we are using the pam_namespace as the most used piece of software to do that.

The pam_namespace PAM module sets up a private namespace for a session with polyinstantiated directories. A polyinstantiated directory provides a different instance of itself based on user name.

The pam_namespace module disassociates the session namespace from the parent namespace. Any mounts/unmounts performed in the parent namespace, such as mounting of devices, are not reflected in the session namespace. Only original mount points are reflected.

There is two ways of doing that using the pam_namespace:

  • Describing polyinstanciated directories and the base storage device to use.

  • Using a script to manually populate the mount namespace using a given base directory.

Basic usage

After adding the pam_namespace in the service’s PAM configuration, the namespace configuartion is done in /sec/security/namespace.d/ files.

A basic usage is simply:

# POLYDIR INSTANCE_PREFIX METHOD UIDS
/tmp      /tmp/poly-inst  user   root

This configuration tells that for each user that is not root (4th field), remount the /tmp (1st field) using the username (3rd field) as a template of unique folders within the instance prefix (2nd field).

To keep things clear, this will bind-mount a private /tmp/poly-inst/$USER onto /tmp for each (non-root) user.

To ensure that isolation is complete, the pam_namespace requires that the instance prefix directory is not accessible to any-one (ie. root-owner and 000 mode).

The complete format is documented in namespace.conf(5) man page.

A side-effect of this is that even system daemons cannot use files of user’s /tmp. Kerberos is one of them : if Kerberos is configured to use /tmp to store credential cache, when the users logs-in the credential cache may be written in the wrong /tmp.

As such, using polyintanciated directories induces that system daemon cannot share files with users. For kerberos, the usage of KCM credential caches is a way to do. See Kerberos handbook for details about Kerberos.

This basic usage implements the first use case : User isolation

Customized usage

A more advanced usage of the pam_namespace is to use custom scripts to setup the user’s namespace. Ocean’s delivers a script that wraps most of the heavy work.

You can enable the usage of this with the following configuration:

# POLYDIR INSTANCE_PREFIX METHOD                                                         UIDS
/ccc      none            tmpfs:create=0700,root,root:mntopts=size=1M:iscript=ccc.setup  root

This configuration mount an empty /ccc using a tmpfs. The tmpfs is set-up with some option (items separated with commas) to change permissions and size of /ccc.

The iscript configuration configures a script to be executed on namespace initialization. This script location is relative to /etc/security/namespace.d.

The pam_namespace gives 4 arguments :

  • The polydir (ie. ``/ccc`)

  • The instance path (ie tmpfs here, /tmp/poly-inst/$USER for the previous use-case)

  • A boolean indicated if the instance path was newly created. Always true for tmpfs.

  • The username

Ocean’s ccc.setup script uses another configuration file that is quite similar to the /etc/fstab file. This file is located in /etc/fstab_user and follows the following format :

# SOURCE                DESTINATION    KIND     OPTS         GROUP_FILTER
/run/mount/private/A    A              bind     defaults,rw  A
/run/mount/private/B    B              bind     defaults,rw  B
/run/mount/private/A    A              bind     defaults,ro  B
A                       C              symlink  defaults     A
B                       C              symlink  defaults     B

Each line corresponds to a new folder or symbolic link (given the 3rd field) relative to the instance path. The name of this is given with the 2nd field.

For bind mounts, the 1st field is the mount point to bind-on. For symbolic links, the 1st field is the target of the link.

The 4th field is only used for bind mounts and indicates mount options to use when creating the bind mount. In most cases, only ro or rw is used. defaults is a keyword for no additional options.

The 5th field indicates a group filter for the mount/link of the given line. The filter can be negated by prepending a ~ in front of the filter, multiple groups are comma-separated.

The this configuration gives the following result for group A’s users:

/ccc/A: Bind-mount (read/write) on /run/mount/private/A

/ccc/C: Symbolic-link to /ccc/A

And for B’s users:

/ccc/A: Bind-mount (read-only) on /run/mount/private/A

/ccc/B: Bind-mount (read/write) on /run/mount/private/B

/ccc/C: Symbolic-link to /ccc/B

Debug

To list mount namespaces currently used, execute the lsns command. This will list a namespaces and associated process.

To enter an already existing mount namespace, you can use the nsenter command :

# nsenter -t %PID -m
# [ Shell inside namespace.. ]

To debug the script’s execution, you can use the execsnoop tool from the bcc-tools package : /usr/share/bcc/tools/execsnoop.