GlusterFS operations ==================== GlusterFS Architecture ---------------------- Ocean's `GlusterFS` deployments are composed of N groups of 3 servers, each group can be seen as an availability cell with a redundancy of 1. This means that only one group member can be lost in each group. Each server have 4 data bricks on a single RAID 10 block device. But `GlusterFS` bricks have to be separate block devices. To do so RAID arrays is splitted using LVM thin pools as follows: .. code-block:: none +-------------------+ +-----------------+ |RAID10 Array (N Gb)|-+--|Brick 1 ( N/2 Gb)| +-------------------+ | +-----------------+ | +-----------------+ +--|Brick 2 ( N/2 Gb)| | +-----------------+ | +-----------------+ +--|Brick 3 ( N/2 Gb)| | +-----------------+ | +-----------------+ +--|Brick 4 ( N/2 Gb)| +-----------------+ As you can see, bricks volume is overcommited. The total volume is twice the raw device volume. This means that LVM thin pools filling have to be monitored. This can be done using the lvs command : .. code-block:: console # lvs gluster LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert brick1 gluster Vwi-aot--- 1.25t thin_pool 27.01 brick2 gluster Vwi-aot--- 1.25t thin_pool 27.74 brick3 gluster Vwi-aot--- 1.25t thin_pool 0.06 brick4 gluster Vwi-aot--- 2.50t thin_pool 0.40 thin_pool gluster twi-aot--- <2.59t 26.87 0.65 The preceding output shows that the overall pool is 27% filled (``thin_pool`` LV) and individual bricks are also 27% filled (``bricks*`` LVs). Ocean's `GlusterFS` deployment provides 2 independent volumes with different setups: ``volspoms1/voldata`` * Distributed, replicated 1 times, 1 arbiter brick, 64MB shards * 3 bricks per server : 2 data bricks, 1 arbiter brick (brick that doesn't consume space but guarantees consistency) * Volumes are set-up with cluster availability cells in a circular fashion: .. code-block:: none +------------------CELL-----------------+ +-----------------CELL-----------------+ | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+| | | Server 1 | | Server 2 | | Server 3 || || Server 4 | | Server 5 | | Server 6 || | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+| | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+| | | Brick 1 | | Brick 1 | | Arbiter || || Brick 1 | | Brick 1 | | Arbiter || | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+| | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+| | | Brick 2 | | Arbiter | | Brick 2 || || Brick 2 | | Arbiter | | Brick 2 || | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+| | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+| | | Arbiter | | Brick 3 | | Brick 3 || || Arbiter | | Brick 2 | | Brick 3 || | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+| +---------------------------------------+ +--------------------------------------+ * Options of the `GlusterFS` ``virt`` option group are applied * Pros: * This volume is well-suited for files bigger than 100Mb * Arbiter volume guarantees consistency without the space usage of a third copy of the data * Cons: * The cost of reconstruction of this volume after an incident will grow with the number of files * Only one copy of the data * Usage: VM images, RPM repos, Pcocc repo, diskless images, ... ``volspoms2/volconf`` * Distributed, Replicated 2 times * 1 brick per server * Volumes are set-up with cluster availability cells naturally: .. code-block:: none +------------------CELL-----------------+ +-----------------CELL-----------------+ | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+| | | Server 1 | | Server 2 | | Server 3 || || Server 4 | | Server 5 | | Server 6 || | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+| | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+| | | Brick 4 | | Brick 4 | | Brick 4 || || Brick 4 | | Brick 4 | | Brick 4 || | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+| +---------------------------------------+ +--------------------------------------+ * Pros: * This volume is well-suited for small files * 3 copies of the data * Simple set-up * Cons: * Consumes lots of space * Usage: Everything else that should be shared Operations ---------- Status ^^^^^^ To monitor a volume you can use the ``gluster volume status``:: # gluster volume status volspoms2 Status of volume: volspoms2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick top1-data.mg1.hpc.domain.fr:/gluste r/brick4/data 49152 0 Y 69912 Brick top2-data.mg1.hpc.domain.fr:/gluste r/brick4/data 49155 0 Y 40746 Brick top3-data.mg1.hpc.domain.fr:/gluste r/brick4/data 49155 0 Y 187494 Brick worker1-data.mg1.hpc.domain.fr:/glu ster/brick4/data 49155 0 Y 87455 Brick worker2-data.mg1.hpc.domain.fr:/glu ster/brick4/data 49155 0 Y 144119 Brick worker3-data.mg1.hpc.domain.fr:/glu ster/brick4/data 49155 0 Y 85702 Self-heal Daemon on localhost N/A N/A Y 64412 Self-heal Daemon on worker3-data.mg1.hpc. domain.fr N/A N/A Y 63672 Self-heal Daemon on top2-data.mg1.hpc. domain.fr N/A N/A Y 224322 Self-heal Daemon on top3-data.mg1.hpc. domain.fr N/A N/A Y 60817 Self-heal Daemon on worker2-data.mg1.hpc. domain.fr N/A N/A Y 176211 Self-heal Daemon on worker1-data.mg1.hpc. domain.fr N/A N/A Y 4036 Task Status of Volume volspoms2 ------------------------------------------------------------------------------ There are no active volume tasks This shows all `GlusterFS` daemons and their statuses (``Online`` column). Those daemons are launched as `systemd` services: .. code-block:: shell # systemctl status glusterd glusterd.service - GlusterFS, a clustered file-system server Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2019-08-28 16:25:20 CEST; 1 months 6 days ago Main PID: 69843 (glusterd) CGroup: /system.slice/glusterd.service ├─46356 /usr/sbin/glusterfsd -s top1-data.mg1.hpc.domain.fr --volfile-id volspoms1.top1-data.mg1.domain... ├─46364 /usr/sbin/glusterfsd -s top1-data.mg1.hpc.domain.fr --volfile-id volspoms1.top1-data.mg1.domain... ├─46371 /usr/sbin/glusterfsd -s top1-data.mg1.hpc.domain.fr --volfile-id volspoms1.top1-data.mg1.domain... ├─64412 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/.. ├─69843 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO └─69912 /usr/sbin/glusterfsd -s top1-data.mg1.hpc.domain.fr --volfile-id volspoms2.top1-data.mg1.domain... ``glusterfsd`` processes are brick daemons while ``glusterfs`` processes are self-heal daemons. Self-heal daemons can be enabled or disabled using ``gluster`` commands: .. code-block:: shell # gluster volume heal volspoms2 disable Disable heal on volume volspoms2 has been successful # gluster volume heal volspoms2 enable Enable heal on volume volspoms2 has been successful Workload ^^^^^^^^ Monitoring `GlusterFS` workload can be done with `top` or `profile` commands. You can also take a state dump of the brick processes. Profile """"""" Profiling must be enabled to get information out of it. .. code-block:: shell # gluster volume profile volspoms2 info Profile on Volume volspoms2 is not started # gluster volume profile volspoms2 start Starting volume profile on volspoms2 has been successful # gluster volume profile volspoms2 info [...] # gluster volume profile volspoms2 stop Stopping volume profile on volspoms2 has been successful Then, using the ``profile info`` command you can get information. It show the following information *per brick*: * Cumulative statistics * Per block size Read/Write counts * Per file operation latency statistics and number of calls * An overall bytes read/writen counter on the given interval * Intervaled statistics * Per block size Read/Write counts * Per file operation latency statistics and number of calls * An overall bytes read/writen counter on the given interval Top """ Using the ``profile top`` command you can: List the 10 most opened files on all bricks or on a specific brick .. code-block:: shell # gluster volume top volspoms2 open list-cnt 10 Brick: top1-data.mg1.hpc.domain.fr:/gluster/brick4/data Current open fds: 54, Max open fds: 176, Max openfd time: 2019-10-03 08:41:23.213966 Count filename ======================= 17497 /logs/mg1.hpc.domain.fr/islet55/messages 14205 /logs/mg1.hpc.domain.fr/islet54/messages 9909 /logs/mg1.hpc.domain.fr/islet12/messages 9406 /logs/mg1.hpc.domain.fr/worker3/messages 7486 /logs/mg1.hpc.domain.fr/top1/messages 6112 /PMSM_namespaces/islet54/log/pmsm.debug [...] .. code-block:: shell # gluster volume top volspoms2 open list-cnt 10 brick worker1-data.mg1.hpc.domain.fr:/gluster/brick4/data Brick: worker1-data.mg1.hpc.domain.fr:/gluster/brick4/data Current open fds: 44, Max open fds: 111, Max openfd time: 2019-10-03 08:41:23.203716 Count filename ======================= 15417 /logs/mg1.hpc.domain.fr/worker1/messages 13698 /logs/mg1.hpc.domain.fr/worker2/messages 12663 /logs/mg1.hpc.domain.fr/infra2/messages 8559 /logs/mg1.hpc.domain.fr/top2/messages 8489 /logs/mg1.hpc.domain.fr/top3/messages 7933 /logs/mg1.hpc.domain.fr/infra1/messages 6487 /tftp_data/poap.py 4844 /logs/mg1.hpc.domain.fr/worker3/cron 4821 /logs/mg1.hpc.domain.fr/ns3/cron 4770 /logs/mg1.hpc.domain.fr/ns2/cron View highest file write calls (overall or per brick) .. code-block:: shell # gluster volume top volspoms1 write list-cnt 20 brick top1-data.mg1.hpc.domain.fr:/gluster/brick1/data Brick: top1-data.mg1.hpc.domain.fr:/gluster/brick1/data Count filename ======================= 15530 12139 /.shard/fee4016d-8773-47d1-a901-9148db080328.66 11407 /9a10eb4c-3ab9-42fa-a210-ff01ff95b599.1 9829 8334 /.shard/995bde66-947a-433d-b436-270b5259386e.42 7392 /pcocc/persistent_drives/i54conf1.qcow2 [...] Similar commands are available for read, opendir and readdir operations. You can also clear counters with the ``gluster volume top volspoms2 clear`` command Statedump """"""""" `GlusterFS` state dumps contains memory usage, io buffer details, translator informations, pending calls, open fds or inode tables. To take and read a state dump of a given server use the ``gluster volume statedump VOLUME``: .. code-block:: shell # gluster volume statedump volspoms1 volume statedump: success # gluster volume get volspoms1 server.statedump-path Option Value ------ ----- server.statedump-path /var/run/gluster # ls /var/run/gluster/gluster-brick* /var/run/gluster/gluster-brick1-data.46356.dump.1570201474 /var/run/gluster/gluster-brick2-data.46364.dump.1570201475 /var/run/gluster/gluster-brick3-data.46371.dump.1570201476 State dumps can be filtered by appending an argument indicating the information to be dumped (``all|mem|iobuf|callpool|priv|fd|inode|history``): .. code-block:: shell # gluster volume statedump volspoms1 callpool volume statedump: success Volume management ^^^^^^^^^^^^^^^^^ Start/Stop """""""""" Starts/Stops are done using the according gluster commands: .. code-block:: shell # gluster volume stop volspoms2 Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y Stopping volume volspoms2 has been successful Self healing """""""""""" A pro-active self-healing daemon diagnoses issues and initiate self-healing every 10 minutes on the files requiring healing. A file needing healing means that there are some unsynced entries beetween replicated parts of the file. ``volume heal`` commands are not really costly and can be launch at any time. You can view the list of files that need healing, the list of files which are currently/previously healed, list of files which are in a split-brain state, and you can manually trigger self-heal on the entire volume or only on the files which need healing. Trigger a self-heal on files requiring healing .. code-block:: shell # gluster volume heal volspoms1 Heal operation on volume volspoms1 has been successful Trigger a self-heal on all the files of a volume. .. code-block:: shell # gluster volume heal volspoms1 full Heal operation on volume volspoms1 has been successful View the list of files that needs healing .. code-block:: shell # gluster volume heal volspoms2 info Brick top1-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries: 0 Brick top2-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries: 0 Brick top3-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries: 0 Brick worker1-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries: 0 Brick worker2-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries: 0 Brick worker3-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries: 0 View the list of files of a particular volume which are in a split-brain state .. code-block:: shell # gluster volume heal volspoms2 info split-brain Brick top1-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries in split-brain: 0 Brick top2-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries in split-brain: 0 Brick top3-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries in split-brain: 0 Brick worker1-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries in split-brain: 0 Brick worker2-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries in split-brain: 0 Brick worker3-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries in split-brain: 0 Logs """" Server and Fuse clients store logs in ``/var/log/glusterfs`` by default. Fuse Clients Create a log file per mount point, log file location can be change with the ``log-file`` mount options. The name of the log file corresponds to the `GlusterFS` volume name. Servers Create a separate log file for each component: * CLI: ``/var/log/glusterfs/cli.log`` * *glusterd*: ``/var/log/glusterfs/glusterd.log`` * Healing daemons: ``/var/log/glusterfs/glfsheal-VOLUME.log``, ``/var/log/glusterfs/glustershd.log``