GlusterFS operations

GlusterFS Architecture

Ocean’s GlusterFS deployments are composed of N groups of 3 servers, each group can be seen as an availability cell with a redundancy of 1. This means that only one group member can be lost in each group.

Each server have 4 data bricks on a single RAID 10 block device. But GlusterFS bricks have to be separate block devices. To do so RAID arrays is splitted using LVM thin pools as follows:

+-------------------+    +-----------------+
|RAID10 Array (N Gb)|-+--|Brick 1 ( N/2 Gb)|
+-------------------+ |  +-----------------+
                      |  +-----------------+
                      +--|Brick 2 ( N/2 Gb)|
                      |  +-----------------+
                      |  +-----------------+
                      +--|Brick 3 ( N/2 Gb)|
                      |  +-----------------+
                      |  +-----------------+
                      +--|Brick 4 ( N/2 Gb)|
                         +-----------------+

As you can see, bricks volume is overcommited. The total volume is twice the raw device volume. This means that LVM thin pools filling have to be monitored.

This can be done using the lvs command :

# lvs gluster
LV        VG      Attr       LSize  Pool      Origin Data%  Meta%  Move Log Cpy%Sync Convert
brick1    gluster Vwi-aot---  1.25t thin_pool        27.01
brick2    gluster Vwi-aot---  1.25t thin_pool        27.74
brick3    gluster Vwi-aot---  1.25t thin_pool        0.06
brick4    gluster Vwi-aot---  2.50t thin_pool        0.40
thin_pool gluster twi-aot--- <2.59t                  26.87  0.65

The preceding output shows that the overall pool is 27% filled (thin_pool LV) and individual bricks are also 27% filled (bricks* LVs).

Ocean’s GlusterFS deployment provides 2 independent volumes with different setups:

volspoms1/voldata
  • Distributed, replicated 1 times, 1 arbiter brick, 64MB shards

  • 3 bricks per server : 2 data bricks, 1 arbiter brick (brick that doesn’t consume space but guarantees consistency)

  • Volumes are set-up with cluster availability cells in a circular fashion:

      +------------------CELL-----------------+ +-----------------CELL-----------------+
      | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
      | | Server 1 | | Server 2 | | Server 3 || || Server 4 | | Server 5 | | Server 6 ||
      | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
      | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
      | |  Brick 1 | |  Brick 1 | |  Arbiter || ||  Brick 1 | |  Brick 1 | |  Arbiter ||
      | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
      | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
      | |  Brick 2 | |  Arbiter | |  Brick 2 || ||  Brick 2 | |  Arbiter | |  Brick 2 ||
      | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
      | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
      | |  Arbiter | |  Brick 3 | |  Brick 3 || ||  Arbiter | |  Brick 2 | |  Brick 3 ||
      | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
      +---------------------------------------+ +--------------------------------------+


  * Options of the `GlusterFS` ``virt`` option group are applied

  * Pros:

    * This volume is well-suited for files bigger than 100Mb

    * Arbiter volume guarantees consistency without the space usage of a third copy of the data

  * Cons:

    * The cost of reconstruction of this volume after an incident will grow with the number of files

    * Only one copy of the data

  * Usage: VM images, RPM repos, Pcocc repo, diskless images, ...

``volspoms2/volconf``
  * Distributed, Replicated 2 times

  * 1 brick per server

  * Volumes are set-up with cluster availability cells naturally:
    +------------------CELL-----------------+ +-----------------CELL-----------------+
    | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
    | | Server 1 | | Server 2 | | Server 3 || || Server 4 | | Server 5 | | Server 6 ||
    | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
    | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
    | |  Brick 4 | |  Brick 4 | |  Brick 4 || ||  Brick 4 | |  Brick 4 | |  Brick 4 ||
    | +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
    +---------------------------------------+ +--------------------------------------+

* Pros:

  * This volume is well-suited for small files

  * 3 copies of the data

  * Simple set-up

* Cons:

  * Consumes lots of space

* Usage: Everything else that should be shared

Operations

Status

To monitor a volume you can use the gluster volume status:

# gluster volume status volspoms2
Status of volume: volspoms2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick top1-data.mg1.hpc.domain.fr:/gluste
r/brick4/data                               49152     0          Y       69912
Brick top2-data.mg1.hpc.domain.fr:/gluste
r/brick4/data                               49155     0          Y       40746
Brick top3-data.mg1.hpc.domain.fr:/gluste
r/brick4/data                               49155     0          Y       187494
Brick worker1-data.mg1.hpc.domain.fr:/glu
ster/brick4/data                            49155     0          Y       87455
Brick worker2-data.mg1.hpc.domain.fr:/glu
ster/brick4/data                            49155     0          Y       144119
Brick worker3-data.mg1.hpc.domain.fr:/glu
ster/brick4/data                            49155     0          Y       85702
Self-heal Daemon on localhost               N/A       N/A        Y       64412
Self-heal Daemon on worker3-data.mg1.hpc.
domain.fr                                    N/A       N/A        Y       63672
Self-heal Daemon on top2-data.mg1.hpc.
domain.fr                                    N/A       N/A        Y       224322
Self-heal Daemon on top3-data.mg1.hpc.
domain.fr                                    N/A       N/A        Y       60817
Self-heal Daemon on worker2-data.mg1.hpc.
domain.fr                                   N/A       N/A        Y       176211
Self-heal Daemon on worker1-data.mg1.hpc.
domain.fr                                   N/A       N/A        Y       4036

Task Status of Volume volspoms2
------------------------------------------------------------------------------
There are no active volume tasks

This shows all GlusterFS daemons and their statuses (Online column). Those daemons are launched as systemd services:

# systemctl status glusterd
  glusterd.service - GlusterFS, a clustered file-system server
   Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2019-08-28 16:25:20 CEST; 1 months 6 days ago
   Main PID: 69843 (glusterd)
   CGroup: /system.slice/glusterd.service
         ├─46356 /usr/sbin/glusterfsd -s top1-data.mg1.hpc.domain.fr --volfile-id volspoms1.top1-data.mg1.domain...
         ├─46364 /usr/sbin/glusterfsd -s top1-data.mg1.hpc.domain.fr --volfile-id volspoms1.top1-data.mg1.domain...
         ├─46371 /usr/sbin/glusterfsd -s top1-data.mg1.hpc.domain.fr --volfile-id volspoms1.top1-data.mg1.domain...
         ├─64412 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/..
         ├─69843 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
         └─69912 /usr/sbin/glusterfsd -s top1-data.mg1.hpc.domain.fr --volfile-id volspoms2.top1-data.mg1.domain...

glusterfsd processes are brick daemons while glusterfs processes are self-heal daemons.

Self-heal daemons can be enabled or disabled using gluster commands:

# gluster volume heal volspoms2 disable
Disable heal on volume volspoms2 has been successful
# gluster volume heal volspoms2 enable
Enable heal on volume volspoms2 has been successful

Workload

Monitoring GlusterFS workload can be done with top or profile commands. You can also take a state dump of the brick processes.

Profile

Profiling must be enabled to get information out of it.

# gluster volume profile volspoms2 info
Profile on Volume volspoms2 is not started
# gluster volume profile volspoms2 start
Starting volume profile on volspoms2 has been successful
# gluster volume profile volspoms2 info
[...]
# gluster volume profile volspoms2 stop
Stopping volume profile on volspoms2 has been successful

Then, using the profile info command you can get information. It show the following information per brick:

  • Cumulative statistics

  • Per block size Read/Write counts

  • Per file operation latency statistics and number of calls

  • An overall bytes read/writen counter on the given interval

  • Intervaled statistics

  • Per block size Read/Write counts

  • Per file operation latency statistics and number of calls

  • An overall bytes read/writen counter on the given interval

Top

Using the profile top command you can:

List the 10 most opened files on all bricks or on a specific brick
# gluster volume top volspoms2 open list-cnt 10
Brick: top1-data.mg1.hpc.domain.fr:/gluster/brick4/data
Current open fds: 54, Max open fds: 176, Max openfd time: 2019-10-03 08:41:23.213966
Count          filename
=======================
17497         /logs/mg1.hpc.domain.fr/islet55/messages
14205         /logs/mg1.hpc.domain.fr/islet54/messages
9909          /logs/mg1.hpc.domain.fr/islet12/messages
9406          /logs/mg1.hpc.domain.fr/worker3/messages
7486          /logs/mg1.hpc.domain.fr/top1/messages
6112          /PMSM_namespaces/islet54/log/pmsm.debug
[...]
# gluster volume top volspoms2 open list-cnt 10 brick worker1-data.mg1.hpc.domain.fr:/gluster/brick4/data
Brick: worker1-data.mg1.hpc.domain.fr:/gluster/brick4/data
Current open fds: 44, Max open fds: 111, Max openfd time: 2019-10-03 08:41:23.203716
Count      filename
=======================
15417     /logs/mg1.hpc.domain.fr/worker1/messages
13698     /logs/mg1.hpc.domain.fr/worker2/messages
12663     /logs/mg1.hpc.domain.fr/infra2/messages
8559      /logs/mg1.hpc.domain.fr/top2/messages
8489      /logs/mg1.hpc.domain.fr/top3/messages
7933      /logs/mg1.hpc.domain.fr/infra1/messages
6487      /tftp_data/poap.py
4844      /logs/mg1.hpc.domain.fr/worker3/cron
4821      /logs/mg1.hpc.domain.fr/ns3/cron
4770      /logs/mg1.hpc.domain.fr/ns2/cron
View highest file write calls (overall or per brick)
# gluster volume top volspoms1 write list-cnt 20 brick top1-data.mg1.hpc.domain.fr:/gluster/brick1/data
Brick: top1-data.mg1.hpc.domain.fr:/gluster/brick1/data
Count   filename
=======================
15530   <gfid:7bc6ea80-d4ce-43e1-a1e0-1899538fbd3c>
12139   /.shard/fee4016d-8773-47d1-a901-9148db080328.66
11407   <gfid:be318638-e8a0-4c6d-977d-7a937aa84806>/9a10eb4c-3ab9-42fa-a210-ff01ff95b599.1
9829    <gfid:971837b5-846f-4111-bb67-73bab39311c2>
8334    /.shard/995bde66-947a-433d-b436-270b5259386e.42
7392    /pcocc/persistent_drives/i54conf1.qcow2
[...]

Similar commands are available for read, opendir and readdir operations.

You can also clear counters with the gluster volume top volspoms2 clear command

Statedump

GlusterFS state dumps contains memory usage, io buffer details, translator informations, pending calls, open fds or inode tables.

To take and read a state dump of a given server use the gluster volume statedump VOLUME:

# gluster volume statedump volspoms1
volume statedump: success
# gluster volume get volspoms1 server.statedump-path
Option                                  Value
------                                  -----
server.statedump-path                   /var/run/gluster
# ls /var/run/gluster/gluster-brick*
/var/run/gluster/gluster-brick1-data.46356.dump.1570201474
/var/run/gluster/gluster-brick2-data.46364.dump.1570201475
/var/run/gluster/gluster-brick3-data.46371.dump.1570201476

State dumps can be filtered by appending an argument indicating the information to be dumped (all|mem|iobuf|callpool|priv|fd|inode|history):

# gluster volume statedump volspoms1 callpool
volume statedump: success

Volume management

Start/Stop

Starts/Stops are done using the according gluster commands:

# gluster volume stop volspoms2
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
Stopping volume volspoms2 has been successful

Self healing

A pro-active self-healing daemon diagnoses issues and initiate self-healing every 10 minutes on the files requiring healing. A file needing healing means that there are some unsynced entries beetween replicated parts of the file. volume heal commands are not really costly and can be launch at any time.

You can view the list of files that need healing, the list of files which are currently/previously healed, list of files which are in a split-brain state, and you can manually trigger self-heal on the entire volume or only on the files which need healing.

Trigger a self-heal on files requiring healing
# gluster volume heal volspoms1
Heal operation on volume volspoms1 has been successful
Trigger a self-heal on all the files of a volume.
# gluster volume heal volspoms1 full
Heal operation on volume volspoms1 has been successful
View the list of files that needs healing
# gluster volume heal volspoms2 info
Brick top1-data.mg1.hpc.domain.fr:/gluster/brick4/data
Status: Connected
Number of entries: 0

Brick top2-data.mg1.hpc.domain.fr:/gluster/brick4/data
Status: Connected
Number of entries: 0

Brick top3-data.mg1.hpc.domain.fr:/gluster/brick4/data
Status: Connected
Number of entries: 0

Brick worker1-data.mg1.hpc.domain.fr:/gluster/brick4/data
Status: Connected
Number of entries: 0

Brick worker2-data.mg1.hpc.domain.fr:/gluster/brick4/data
Status: Connected
Number of entries: 0

Brick worker3-data.mg1.hpc.domain.fr:/gluster/brick4/data
Status: Connected
Number of entries: 0
View the list of files of a particular volume which are in a split-brain state
# gluster volume heal volspoms2 info split-brain
Brick top1-data.mg1.hpc.domain.fr:/gluster/brick4/data
Status: Connected
Number of entries in split-brain: 0

Brick top2-data.mg1.hpc.domain.fr:/gluster/brick4/data
Status: Connected
Number of entries in split-brain: 0

Brick top3-data.mg1.hpc.domain.fr:/gluster/brick4/data
Status: Connected
Number of entries in split-brain: 0

Brick worker1-data.mg1.hpc.domain.fr:/gluster/brick4/data
Status: Connected
Number of entries in split-brain: 0

Brick worker2-data.mg1.hpc.domain.fr:/gluster/brick4/data
Status: Connected
Number of entries in split-brain: 0

Brick worker3-data.mg1.hpc.domain.fr:/gluster/brick4/data
Status: Connected
Number of entries in split-brain: 0

Logs

Server and Fuse clients store logs in /var/log/glusterfs by default.

Fuse Clients

Create a log file per mount point, log file location can be change with the log-file mount options. The name of the log file corresponds to the GlusterFS volume name.

Servers

Create a separate log file for each component:

  • CLI: /var/log/glusterfs/cli.log

  • glusterd: /var/log/glusterfs/glusterd.log

  • Healing daemons: /var/log/glusterfs/glfsheal-VOLUME.log, /var/log/glusterfs/glustershd.log