GlusterFS operations¶
GlusterFS Architecture¶
Ocean’s GlusterFS deployments are composed of N groups of 3 servers, each group can be seen as an availability cell with a redundancy of 1. This means that only one group member can be lost in each group.
Each server have 4 data bricks on a single RAID 10 block device. But GlusterFS bricks have to be separate block devices. To do so RAID arrays is splitted using LVM thin pools as follows:
+-------------------+ +-----------------+
|RAID10 Array (N Gb)|-+--|Brick 1 ( N/2 Gb)|
+-------------------+ | +-----------------+
| +-----------------+
+--|Brick 2 ( N/2 Gb)|
| +-----------------+
| +-----------------+
+--|Brick 3 ( N/2 Gb)|
| +-----------------+
| +-----------------+
+--|Brick 4 ( N/2 Gb)|
+-----------------+
As you can see, bricks volume is overcommited. The total volume is twice the raw device volume. This means that LVM thin pools filling have to be monitored.
This can be done using the lvs command :
# lvs gluster
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
brick1 gluster Vwi-aot--- 1.25t thin_pool 27.01
brick2 gluster Vwi-aot--- 1.25t thin_pool 27.74
brick3 gluster Vwi-aot--- 1.25t thin_pool 0.06
brick4 gluster Vwi-aot--- 2.50t thin_pool 0.40
thin_pool gluster twi-aot--- <2.59t 26.87 0.65
The preceding output shows that the overall pool is 27% filled (thin_pool
LV) and individual bricks are also 27% filled (bricks*
LVs).
Ocean’s GlusterFS deployment provides 2 independent volumes with different setups:
volspoms1/voldata
Distributed, replicated 1 times, 1 arbiter brick, 64MB shards
3 bricks per server : 2 data bricks, 1 arbiter brick (brick that doesn’t consume space but guarantees consistency)
Volumes are set-up with cluster availability cells in a circular fashion:
+------------------CELL-----------------+ +-----------------CELL-----------------+
| +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
| | Server 1 | | Server 2 | | Server 3 || || Server 4 | | Server 5 | | Server 6 ||
| +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
| +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
| | Brick 1 | | Brick 1 | | Arbiter || || Brick 1 | | Brick 1 | | Arbiter ||
| +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
| +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
| | Brick 2 | | Arbiter | | Brick 2 || || Brick 2 | | Arbiter | | Brick 2 ||
| +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
| +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
| | Arbiter | | Brick 3 | | Brick 3 || || Arbiter | | Brick 2 | | Brick 3 ||
| +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
+---------------------------------------+ +--------------------------------------+
* Options of the `GlusterFS` ``virt`` option group are applied
* Pros:
* This volume is well-suited for files bigger than 100Mb
* Arbiter volume guarantees consistency without the space usage of a third copy of the data
* Cons:
* The cost of reconstruction of this volume after an incident will grow with the number of files
* Only one copy of the data
* Usage: VM images, RPM repos, Pcocc repo, diskless images, ...
``volspoms2/volconf``
* Distributed, Replicated 2 times
* 1 brick per server
* Volumes are set-up with cluster availability cells naturally:
+------------------CELL-----------------+ +-----------------CELL-----------------+
| +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
| | Server 1 | | Server 2 | | Server 3 || || Server 4 | | Server 5 | | Server 6 ||
| +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
| +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
| | Brick 4 | | Brick 4 | | Brick 4 || || Brick 4 | | Brick 4 | | Brick 4 ||
| +----------+ +----------+ +----------+| |+----------+ +----------+ +----------+|
+---------------------------------------+ +--------------------------------------+
* Pros:
* This volume is well-suited for small files
* 3 copies of the data
* Simple set-up
* Cons:
* Consumes lots of space
* Usage: Everything else that should be shared
Operations¶
Status¶
To monitor a volume you can use the gluster volume status
:
# gluster volume status volspoms2
Status of volume: volspoms2
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick top1-data.mg1.hpc.domain.fr:/gluste
r/brick4/data 49152 0 Y 69912
Brick top2-data.mg1.hpc.domain.fr:/gluste
r/brick4/data 49155 0 Y 40746
Brick top3-data.mg1.hpc.domain.fr:/gluste
r/brick4/data 49155 0 Y 187494
Brick worker1-data.mg1.hpc.domain.fr:/glu
ster/brick4/data 49155 0 Y 87455
Brick worker2-data.mg1.hpc.domain.fr:/glu
ster/brick4/data 49155 0 Y 144119
Brick worker3-data.mg1.hpc.domain.fr:/glu
ster/brick4/data 49155 0 Y 85702
Self-heal Daemon on localhost N/A N/A Y 64412
Self-heal Daemon on worker3-data.mg1.hpc.
domain.fr N/A N/A Y 63672
Self-heal Daemon on top2-data.mg1.hpc.
domain.fr N/A N/A Y 224322
Self-heal Daemon on top3-data.mg1.hpc.
domain.fr N/A N/A Y 60817
Self-heal Daemon on worker2-data.mg1.hpc.
domain.fr N/A N/A Y 176211
Self-heal Daemon on worker1-data.mg1.hpc.
domain.fr N/A N/A Y 4036
Task Status of Volume volspoms2
------------------------------------------------------------------------------
There are no active volume tasks
This shows all GlusterFS daemons and their statuses (Online
column). Those daemons are launched as systemd services:
# systemctl status glusterd
glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2019-08-28 16:25:20 CEST; 1 months 6 days ago
Main PID: 69843 (glusterd)
CGroup: /system.slice/glusterd.service
├─46356 /usr/sbin/glusterfsd -s top1-data.mg1.hpc.domain.fr --volfile-id volspoms1.top1-data.mg1.domain...
├─46364 /usr/sbin/glusterfsd -s top1-data.mg1.hpc.domain.fr --volfile-id volspoms1.top1-data.mg1.domain...
├─46371 /usr/sbin/glusterfsd -s top1-data.mg1.hpc.domain.fr --volfile-id volspoms1.top1-data.mg1.domain...
├─64412 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/..
├─69843 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
└─69912 /usr/sbin/glusterfsd -s top1-data.mg1.hpc.domain.fr --volfile-id volspoms2.top1-data.mg1.domain...
glusterfsd
processes are brick daemons while glusterfs
processes are self-heal daemons.
Self-heal daemons can be enabled or disabled using gluster
commands:
# gluster volume heal volspoms2 disable
Disable heal on volume volspoms2 has been successful
# gluster volume heal volspoms2 enable
Enable heal on volume volspoms2 has been successful
Workload¶
Monitoring GlusterFS workload can be done with top or profile commands. You can also take a state dump of the brick processes.
Profile¶
Profiling must be enabled to get information out of it.
# gluster volume profile volspoms2 info
Profile on Volume volspoms2 is not started
# gluster volume profile volspoms2 start
Starting volume profile on volspoms2 has been successful
# gluster volume profile volspoms2 info
[...]
# gluster volume profile volspoms2 stop
Stopping volume profile on volspoms2 has been successful
Then, using the profile info
command you can get information. It show the following information per brick:
Cumulative statistics
Per block size Read/Write counts
Per file operation latency statistics and number of calls
An overall bytes read/writen counter on the given interval
Intervaled statistics
Per block size Read/Write counts
Per file operation latency statistics and number of calls
An overall bytes read/writen counter on the given interval
Top¶
Using the profile top
command you can:
- List the 10 most opened files on all bricks or on a specific brick
# gluster volume top volspoms2 open list-cnt 10 Brick: top1-data.mg1.hpc.domain.fr:/gluster/brick4/data Current open fds: 54, Max open fds: 176, Max openfd time: 2019-10-03 08:41:23.213966 Count filename ======================= 17497 /logs/mg1.hpc.domain.fr/islet55/messages 14205 /logs/mg1.hpc.domain.fr/islet54/messages 9909 /logs/mg1.hpc.domain.fr/islet12/messages 9406 /logs/mg1.hpc.domain.fr/worker3/messages 7486 /logs/mg1.hpc.domain.fr/top1/messages 6112 /PMSM_namespaces/islet54/log/pmsm.debug [...]# gluster volume top volspoms2 open list-cnt 10 brick worker1-data.mg1.hpc.domain.fr:/gluster/brick4/data Brick: worker1-data.mg1.hpc.domain.fr:/gluster/brick4/data Current open fds: 44, Max open fds: 111, Max openfd time: 2019-10-03 08:41:23.203716 Count filename ======================= 15417 /logs/mg1.hpc.domain.fr/worker1/messages 13698 /logs/mg1.hpc.domain.fr/worker2/messages 12663 /logs/mg1.hpc.domain.fr/infra2/messages 8559 /logs/mg1.hpc.domain.fr/top2/messages 8489 /logs/mg1.hpc.domain.fr/top3/messages 7933 /logs/mg1.hpc.domain.fr/infra1/messages 6487 /tftp_data/poap.py 4844 /logs/mg1.hpc.domain.fr/worker3/cron 4821 /logs/mg1.hpc.domain.fr/ns3/cron 4770 /logs/mg1.hpc.domain.fr/ns2/cron- View highest file write calls (overall or per brick)
# gluster volume top volspoms1 write list-cnt 20 brick top1-data.mg1.hpc.domain.fr:/gluster/brick1/data Brick: top1-data.mg1.hpc.domain.fr:/gluster/brick1/data Count filename ======================= 15530 <gfid:7bc6ea80-d4ce-43e1-a1e0-1899538fbd3c> 12139 /.shard/fee4016d-8773-47d1-a901-9148db080328.66 11407 <gfid:be318638-e8a0-4c6d-977d-7a937aa84806>/9a10eb4c-3ab9-42fa-a210-ff01ff95b599.1 9829 <gfid:971837b5-846f-4111-bb67-73bab39311c2> 8334 /.shard/995bde66-947a-433d-b436-270b5259386e.42 7392 /pcocc/persistent_drives/i54conf1.qcow2 [...]
Similar commands are available for read, opendir and readdir operations.
You can also clear counters with the gluster volume top volspoms2 clear
command
Statedump¶
GlusterFS state dumps contains memory usage, io buffer details, translator informations, pending calls, open fds or inode tables.
To take and read a state dump of a given server use the gluster volume statedump VOLUME
:
# gluster volume statedump volspoms1
volume statedump: success
# gluster volume get volspoms1 server.statedump-path
Option Value
------ -----
server.statedump-path /var/run/gluster
# ls /var/run/gluster/gluster-brick*
/var/run/gluster/gluster-brick1-data.46356.dump.1570201474
/var/run/gluster/gluster-brick2-data.46364.dump.1570201475
/var/run/gluster/gluster-brick3-data.46371.dump.1570201476
State dumps can be filtered by appending an argument indicating the information to be dumped (all|mem|iobuf|callpool|priv|fd|inode|history
):
# gluster volume statedump volspoms1 callpool
volume statedump: success
Volume management¶
Start/Stop¶
Starts/Stops are done using the according gluster commands:
# gluster volume stop volspoms2
Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y
Stopping volume volspoms2 has been successful
Self healing¶
A pro-active self-healing daemon diagnoses issues and initiate self-healing every 10 minutes on the files requiring healing. A file needing healing means that there are some unsynced entries beetween replicated parts of the file. volume heal
commands are not really costly and can be launch at any time.
You can view the list of files that need healing, the list of files which are currently/previously healed, list of files which are in a split-brain state, and you can manually trigger self-heal on the entire volume or only on the files which need healing.
- Trigger a self-heal on files requiring healing
# gluster volume heal volspoms1 Heal operation on volume volspoms1 has been successful
- Trigger a self-heal on all the files of a volume.
# gluster volume heal volspoms1 full Heal operation on volume volspoms1 has been successful
- View the list of files that needs healing
# gluster volume heal volspoms2 info Brick top1-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries: 0 Brick top2-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries: 0 Brick top3-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries: 0 Brick worker1-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries: 0 Brick worker2-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries: 0 Brick worker3-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries: 0- View the list of files of a particular volume which are in a split-brain state
# gluster volume heal volspoms2 info split-brain Brick top1-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries in split-brain: 0 Brick top2-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries in split-brain: 0 Brick top3-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries in split-brain: 0 Brick worker1-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries in split-brain: 0 Brick worker2-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries in split-brain: 0 Brick worker3-data.mg1.hpc.domain.fr:/gluster/brick4/data Status: Connected Number of entries in split-brain: 0
Logs¶
Server and Fuse clients store logs in /var/log/glusterfs
by default.
- Fuse Clients
Create a log file per mount point, log file location can be change with the
log-file
mount options. The name of the log file corresponds to the GlusterFS volume name.- Servers
Create a separate log file for each component:
CLI:
/var/log/glusterfs/cli.log
glusterd:
/var/log/glusterfs/glusterd.log
Healing daemons:
/var/log/glusterfs/glfsheal-VOLUME.log
,/var/log/glusterfs/glustershd.log