Installation methodology¶
Todo
Offsite, Onsite and production phases, (message d’origine)
Offsite, Onsite and production phases¶
La phase Offsite comprend
la prise en compte de l’implentation en salle machine du calculateur
l’intégratrion des données Constructeur correspondant aux matériels à installer et à configurer
la mise en place d’une plateforme et solution logicielle en charge de l’enregistrement et la vie des informations précédentes
machine ou vm supportant la pile logicielle Ocean
paquets logiciels innovants pour:
intégrer rapidement les évolutions structurelles du calculateur
interfacer les données de différents constructeurs dans un format unifié et standardisé YAML
faciliter l’exploitation au quotidien et au maintien en conditions opérationnelles
la création d’un média DVD ou USB amorçant le déploiement le premier noeud du calculateur
la configuration du réseau et plan d’adressage
la virtualisation telle qu’elle va être utilisée dans cette pile logicielle
la configuration et l’organisation du stockage et des couches logicielles associées
La phase Onsite constitue l’installation initiale de l’ensemble des éléments constitutant ce calculateur. On y décrit toutes les étapes de mises en place des sous-ensembles noeuds et infrastructure réseau nécessaires au bon fonctionnement de l’ensemble. Dans cette phase, on y trouve également l’intégration de la plateforme précédente.
La phase Production est l’exploitation de la phase Onsite. Associée aux sections ‘Administration’, ‘HandBooks’ et ‘CookBooks’, elle permet le maintien en conditions opérationnelles du calculateur. Elle intégre naturellement les outils de gestion d’administration, de ‘Monitoring’ et les automates d’actions auto-correctives afin d’assurer une haute disponibilité.
La Formation aux outils et des dispositions fonctionnelles constitue un élément essentiel au travail collaboratif entre les équipes.
Offsite preparation¶
Données Constructeur¶
Cette étape va permettre de récupérer et croiser les différentes informations :
l’emplacement physique (topologie en salle machine) fournie par le Client
la constitution physique du calculateur (fournie par le Constructeur)
le nombre d’éléments ‘rack’ et leur hauteur
les sous-ensembles constituant chaque rack
les capacités de connectique de chaque sous-ensemble
les liaisons entre connectiques des différents sous-ensembles
l’adressage MAC d’une partie des équipements (top et worker), et pour le reste le support de l’option 82 (dhcp)
Tous ces éléments vont être consolidés dans une base de données. Cette base, dans un premier temps externe au calculateur sera ré-introduite lors de la phase Onsite. Elle sera utilisée tout au long de la vie du calculateur, entre autres, pour les interventions matérielles. Les éléments Constructeur sont fournis sous forme de ‘netlist’. Une compréhension de cette dernier voire une évolution de sa présentation pourra être demandée à celui-ci.
Plateforme et logiciels pour la connaissance structurelle¶
La mise en place des outils Ocean pour la gestion et vie de l’environnement demande la mise en place d’un système.
Le choix de l’utilsation d’une VM donne plus de souplesse, de la mobilité et de sécurité (sauvegarde d’un seul fichier qcow)
VM SitePrep¶
Installation et configuration Pcocc pour une vm siteprep
Todo
Recuperation des notes de la mise en place de la vm mtest
Installation de Racktables, hwdb et netcc
RackTables:
Installation
yum install -y RackTables mariadb-server mariadbApache integration
mkdir /var/www/html/racktables ln -s /usr/share/RackTables/wwwroot/index.php /var/www/html/racktablesStarting services and following instructions displayed on (%HOST%/racktables)
systemctl start httpd mariadb # Step 1: login and password are described in secret.php touch '/etc/RackTables/secret.php'; chmod a=rw '/etc/RackTables/secret.php' # Step 3 mysql << EOF CREATE DATABASE racktables_db CHARACTER SET utf8 COLLATE utf8_general_ci; CREATE USER racktables_user@localhost IDENTIFIED BY 'MY_SECRET_PASSWORD'; GRANT ALL PRIVILEGES ON racktables_db.* TO racktables_user@localhost; EOF # Step 4 chmod 440 /etc/RackTables/secret.php chown apache:apache /etc/RackTables/secret.phphwdb (inside package confiture):
Row insertion:
hwdb obj add -t Row A hwdb obj list
Rack by row:
hwdb obj add -t Rack --container A A[3-7] hwdb obj list
Restauration des types, models, ports et compatiblités
Todo
Uniformiser les options de restauration !
hwdb port type restore rt_dumps/ptypes.dump hwdb port compat restore --csv rt_dumps/pcompat.dump hwdb obj model restore rt_dumps/models.dump
Insertion des Sequana2
hwdb cell add --rack A6 --prefix s20 templates/sequana2.hw.yaml hwdb cell add --rack A7 --prefix s22 templates/sequana2.hw.yamlInsert switchs, diskarrays, top and worker (example)
# inserts 2 servers (2U) at base level 15 and 17 for rack A4 hwdb obj add -t server --container A4 --label top --slots 15,17 --size 2 top[1,2] hwdb obj update --model "SuperMicro 2U" x430 top[1-3] hwdb obj update --model "SuperMicro 2U" x430 worker[1-3] # inserts 2 x nexus 9364c in rack A3 at base level 6 and 9 hwdb obj add -t 'network switch' --container A3 --label nexus-9364c \ --slots 6,9 --size 2 esw[1-2] # inserts 4 x 3650 in rack A4 between level 6 and 9 hwdb obj update --model "Nexus" 9364c esw[1-2] hwdb obj add -t 'network switch' --container A4 --label 3650 \ --slots 6,7,8,9 --size 1 esw[3-6] hwdb obj update --model "Cisco" 3650 esw[3-6] # inserts jbod in rack A5 at base level 6 and 9 hwdb obj add -t DiskArray --container A6 --label jbod-r6 \ --slots 3 --size 2 --model "SuperMicro 2U" x430 yyy # Insert Colddoor hwdb obj add -t PDU --container A3 --label cooldoor --slots 1 --size 1 --subcontainer rear i0r0cooldoor0Inserts links
hwdb port add --label master -t hardwired 1000Base-T node180 Ethernet1 hwdb port add --label slave -t hardwired 1000Base-T node180 Ethernet2 # And links hwdb port link i10esw1 Ethernet4 node180 Ethernet1 hwdb port link i10esw2 Ethernet4 node180 Ethernet2 hwdb port update --label 'master opt82 shared=BMC nolag' node180 Ethernet1 hwdb port update --label 'slave opt82 shared=BMCslave nolag' node180 Ethernet2 # hwdb port compat add --from 1000Base-T --to 'empty SFP+' # Uplinks hwdb port add -t QSFP+ 'empty QSFP' esw[1-2] Ethernet[1-48] hwdb link esw1 Ethernet[1-4] esw2 Ethernet[1-4] hwdb port link i10esw1 Ethernet53 esw1 Ethernet5 hwdb port link i10esw2 Ethernet53 esw2 Ethernet5 # lags and speed hwdb port update --label 'speed=40000 lag=vpc10' i10esw[1-2] Ethernet53 hwdb port update --label 'speed=40000 lag=vpc10' esw[1-2] Ethernet5A tool which reads Provider netlist and converts all entries in hwdb commands
Note
A structured csv file with only descriptions usable in a specific sheet could be provided to reduce the disparities between the different manufacturers and allow a simplified production of hwdb commands
Todo
écrire un cahier des charges du besoin lié à netcc
Installation et configuration de confiture
Installation
yum install -y confiture git emacs-nox vim vim-enhancedBootstrap confiture
git init cluster cp -aR /usr/share/doc/confiture*/examples/* cluster/Configurez l’URL vers la DB dans le
confiture.yaml
, les paths sont relatifs à l’emplacement deconfiture.yaml
# Starting configuration (/path/confiture/confiture.yaml) common: hiera_conf: hirea.yaml template_dir: templates/ output_dir: output/ dhcp: conf_name: dhcpd.conf dns: conf_name: named.conf racktables: url: 'mysql://racktables_user:MY_SECRET_PASSWORD@localhost/racktables_db'Confiture Network Range
Dans le fichier network.yaml, on définit les sous-réseaux associés à chacun des équipements:
1 réseau
bbone
pour l’accès backbone pour les top et worker: A.B.C.0/241 réseau
eq
pour l’accès et la surveillance des équipements: E.4.0.0/231 réseau
adm
pour l’accès admin des noeuds de management: E.1.0.0/241 réseau
data
pour l’accès aux données dans glusterfs: E.5.0.0/241 réseau
ipmi
pour l’accès admin des noeuds de management: E.4.0.0/24Todo
vérifier les définitions réseaux
networks: # TOP Bbone network bbone: range: A.B.C.0/24 interface: 'enp130s0f0' nameservers: - "${address('top1-bone')}" tftpservers: - "${address('top1-bone')}" bmgrservers: - "${address('top1-bone')}" # Vlan 1 ? eq: range: X.0.0.0/23 interface: 'eno2' nameservers: - "${address('top1-eq')}" ntpservers: - "${address('top1-eq')}" tftpservers: - "${address('top1-eq')}" bmgrservers: - "${address('top1-eq')}" # Administration network # Vlan 1 ? adm: range: X.1.0.0/24 interface: 'ens1' bmgrservers: - "${address('top1-adm')}" nameservers: - "${address('top1-adm')}" ntpservers: - "${address('top1-adm')}" tftpservers: - "${address('top1-adm')}" # Vlan 1 ? data: range: X.5.0.0/24 interface: 'enp130s0f0' bmgrservers: - "${address('top1-adm')}" nameservers: - "${address('top1-adm')}" ntpservers: - "${address('top1-adm')}" tftpservers: - "${address('top1-adm')}" # BMC: physical network # Vlan 104 ? ipmi range: X.4.0.0/24 interface: 'enp130s0f0' bmgrservers: - "${address('top1-adm')}" nameservers: - "${address('top1-adm')}" ntpservers: - "${address('top1-adm')}" tftpservers: - "${address('top1-adm')}" [...]Dans le fichier addresses.yaml, on associe des IP aux réseaux précédents. Exemple:
addresses: top[1-3]: default: [adm,eq,bone,data,ipmi] bbone: A.B.C.[1-3] eq X.0.0.[1-3] adm: X.1.0.[1-3] data: X.5.O.[1-3] ipmi A.B.C.[128-130] worker[1-3]: default: [adm,eq,bone,data,ipmi] bbone: A.B.C.[4-6] eq X.0.0.[4-6] adm: X.1.0.[4-6] data: X.5.O.[4-6] ipmi A.4.0.[128-130] esw[1-2]: default: [adm] adm: A.4.$(islet-id).[1-2] esw[4-6]: default: [adm] adm: A.4.$(islet-id).[4-6]Todo
A compléter avec les swicths et les islets?
Installation media preparation¶
On-site installation will require a traditional installation using external installation media. To guaranty that this media’s content match what we intend to install, we will generate it. To do this, we must have a node (virtual or not) running the OS we want to install.
Here, we will use the latest cloud-ocean
pcocc image available. Other means can be used to launch the very same image (VirtualBox, libvirt, …).
Boot image¶
First step is to generate a boot image using lorax. This image will include a minimal OS and the anaconda installer. No other content (RPMs for instance) is included.
Install lorax:
yum install -y lorax
Generation currently requires the CentOS-os, CentOS-updates, CentOS-extras and Ocean repositories. Collect the required repository URLs:
yum repolist -v | grep baseurl
Launch lorax generation:
# Ocean major.minor version (2.x)
oswanted=2.6
# URL Ocean repo
yumsrv="http://pkg/mirror/pub/linux/ocean/"
lorax --isfinal -p Ocean -v ${oswanted} -r 1 \
-s ${yumsrv}/${oswanted}/ocean/x86_64 \
-s ${yumsrv}/${oswanted}/centos-os/x86_64 \
-s ${yumsrv}/${oswanted}/centos-updates/x86_64 \
-s ${yumsrv}/${oswanted}/centos-extras/x86_64 \
-s ${yumsrv}/${oswanted}/epel/x86_64 \
-s ${yumsrv}/${oswanted}/ocean/x86_64 \
-s ${yumsrv}/${oswanted}/greyzone/x86_64 \
/tmp/lorax_image
Installation repos¶
Now we have to include some content into the generated image. First, gather all the packages that might be required during the kickstart using yum
:
mkdir -p /tmp/ocean_media/Packages/
yum install -y --installroot=/tmp/ocean_media/Packages/ --downloadonly --downloaddir=/tmp/ocean_media/Packages/ @core @base @anaconda-tools anaconda puppet puppet4 bridge-utils lsof minicom strace tcpdump vim emacs-nox bind-utils crash yum-utils
rm -Rf /tmp/ocean_media/Packages/var
If any other package is required it should be included here.
Recreate the yum groups using the CentOS’s comps.xml:
createrepo -g /dev/shm/packages/ocean_centos/comps.xml /tmp/ocean_media/
Note
CentOS comps.xml is available here : http://mirror.centos.org/centos/7/os/x86_64/repodata/aced7d22b338fdf7c0a71ffcf32614e058f4422c42476d1f4b9e9364d567702f-c7-x86_64-comps.xml
Media metadata¶
Mount and copy the content of the generated boot image:
mkdir /mnt/lorax_image /tmp/lorax_image_content
mount -o loop /tmp/lorax_image/images/boot.iso /mnt/lorax_image
rsync -avr /mnt/lorax_image/ /tmp/lorax_image_content
rm /tmp/lorax_image_content/isolinux/boot.cat
And now that we have all the bits to make the media, assemble everything:
mkisofs -o /tmp/ocean.iso -b isolinux/isolinux.bin -c isolinux/boot.cat -boot-load-size 4 -boot-info-table -no-emul-boot -eltorito-alt-boot -e images/efiboot.img -no-emul-boot -R -V "Ocean ${oswanted} x86_64" -T -graft-points isolinux=/tmp/lorax_image_content/isolinux images/pxeboot=/tmp/lorax_image_content/images/pxeboot LiveOS=/tmp/lorax_image_content/LiveOS EFI/BOOT=/tmp/lorax_image_content/EFI/BOOT images/efiboot.img=/tmp/lorax_image_content/images/efiboot.img .discinfo=/tmp/lorax_image/.discinfo .treeinfo=/tmp/lorax_image/.treeinfo Packages=/tmp/ocean_media/Packages repodata=/tmp/ocean_media/repodata
isohybrid --uefi /tmp/ocean.iso
implantisomd5 /tmp/ocean.iso
checkisomd5 /tmp/ocean.iso
Finally, try it out on a machine with qemu installed and X11 access:
qemu-system-x86_64 -m 1024 -smp 1 -cdrom ./ocean.iso
When validated, burn it on a DVD or on USB storage
RAID Configuration on top and worker¶
Using console plugged on each node, create:
1 RAID1 named ‘system’ with the two first drives
1 RAID10 named ‘data’ with all other drives
Initialize all RAID drives
Network definition¶
Todo
Guide on how to do RackTable insertion. Vlan & IP Partioning design. The result should be configuration files generated by confiture (DNS, DHCP, Switches).
Vlan and IP design guide¶
In an Ocean Stack cluster, the first requirement is that each islet must be in a independent set of VLANs. This is a requisite for three reasons : scalability, reliability and management ease.
This is because a cluster evolves. Adding or removing nodes must not affect the operational state of the cluster.
Because of this, the ethernet fabric design should be able to route between those VLANs in an effective way. The current best-pratice (documented N9K L3 Fabric architecture), uses a L3 fabric and the BGP protocol as a way to dynamically route IP traffic between islet.
A second requirement is to do an clear separation of IP Subnets that depends on node or equipment types. For example, in a compute islet, compute nodes and their related BMC should be in separate IP subnets. The same thing should be done for administrative nodes versus service IPs.
A best-practice is to do a hierarchical allocation of IP subnets that respect CIDR subnets. This makes the design of ACLs easier. For ex. having all the administrative allocations in the first “/13” subnets and all the service nodes in the second “/13” subnets.
An example of IP allocation could be :
And with a VLAN mapping that ensures no equipments can spoof another equipment of another type:
VLAN |
IP Subnet |
A |
10.1.0.0/24, 10.3.0.0/24 |
B |
10.8.0.0/24 |
C |
10.1.10.0/24 |
D |
10.1.20.0/24 |
E |
10.8.20.0/24 |
F |
10.16.20.0/24,10.32.20.0/24 |
Virtual machine definition¶
Todo
Guide on how to do VM definition (Pcocc + Puppet) with ready-to-use examples for mandatory services.
Storage definition¶
Todo
Guide on how to design the GlusterFS cluster. May be limited to our way to use gluster (blocks of 3 servers)
Onsite Installation¶
Overview¶
The installation process is roughtly the following:
Install the base system on the first management node
Configure this node with all the components needed to deploy the other management nodes
Deploy the other management nodes using the management network
Deploy the Ethernet fabric (administration network)
Install and configure the Ocean components on those nodes using the temporary infrasctrure of the first node
Validate the final infrastructure
Redeploy and integrate the first node
Note that most configuration files will be already generated using confiture, the Ocean’s configuration generator.
When these step are all done, diskless or diskfull compute nodes can be deployed. Compute node hardware specifics are out-of scope of this document but some advises might be present.
Requirements¶
Management nodes should be configured with storage system ready to use. The name of thoses disks (viewed from the OS) will be required by BMGR for the kickstart process.
We advise a minimum of 60Gb RAID1 storage for the management node system. Data storage will depend on your hardware but hardware RAID controller are preffered over software ones.
Note
The top management nodes of our test-bed got 2 SATA-DOM in RAID1 (Intel Rapid Storage) and 10 disks in RAID10 (+2 Hot Spares), respectively viewed as Volume0_0
and sdc
drives.
Default BIOS configuration will be just fine on most cases, we just need the following features to be activated (or deactivated):
SRIOV support activated
AES-NI support activated (not mandatory but advised)
Legacy boot only
BMC configured with DHCP (if they are cabled inside the cluster, at your discretion if not).
Energy saving features disabled (Fan profile, CPU profile, Energy efficient features, …)
Boot order: Network, CD/DVD, USB, system hard drives
Network boot devices, this setting might be handled by an option ROM:
Interface cabled onto bbone network for top worker nodes
Interface cabled onto management network for the other management nodes
Interface cabled onto administration network
Moreover, network switchs should be in factory configuration.
Note
To factory reset Cisco switch, in the management shell : erase startup-config
and reload
Note
To factory reset Arista switch: in the Aboot shell (at boot time) : mv /mnt/flash/startup-config /mnt/flash/startup-config.old
reboot
or in a priviledged shell : erase startup-config
and reload
Warning
Some switch have their ports disabled while port are going up (SpanningTree-related). Moreover, DHCP snooping may be enabled by default. To mitigate both issues, set the DHCP server are a trusted source port (ip dhcp snooping trust
) and set server-facing ports as edge ports (or Cisco’s portfast
, spanning-tree portfast
)
This installation method also requires that Ocean’s repositories are reachable.
First node deployment¶
System installation¶
With the Ocean’s installation media burned on a USB key or DVD, boot the first node. Graphical installer is not available here as the textual installer is easier to document and to use in a console installation context.
If you have never done it, we advise to check the media using the “Test this media & Install Ocean ${oswanted}” boot option. It might take some time but gives confidence.
When the installer is launched and prompts you the main menu, you can now proceed with the configuration :
Language setting: English (United Stated)
Timezone : Europe/Paris
Installation source : Local media (auto detected)
Software selection : Minimal Install
Installation destination: Use the whole system disk with LVM. The partitioning scheme doesn’t really matter here as we’ll reinstall this node soon.
KDump: Enabled
Network configuration: Configure the backbone interface in order to get a remote access. Also configure nameservers and hostnames.
Root password: Configure a temporary root password
User creation: No system user should be created
System pre-configuration¶
If you have anything to do after the installation but before rebooting, you can modify configuration in anaconda’s shell (switch with Alt+Tab). System is installed within /mnt/sysimage
.
For instance, here we disable firewalld
and SElinux
and change ssh default port:
systemctl --root /mnt/sysimage disable firewalld
sed -i 's/^SELINUX=enforcing/SELINUX=disabled/' /mnt/sysimage/etc/selinux/config
sed -i 's/^#Port 22/Port 422/' /mnt/sysimage/etc/ssh/sshd_config
After the installation is complete, make sure the node is booting on the system disks and open a remote shell onto it.
System configuration¶
Anaconda installations enables some unwanted features like SELinux and firewalld. Make them inactive:
systemctl disable --now firewalld
setenforce Permissive
sed -i 's/^SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config
For security purposes, we strongly recommends the usage of the 422 port for SSH. To do so:
sed -i 's/^#Port 22/Port 422/' /etc/ssh/sshd_config
systemctl restart sshd
The installation process requires complete Ocean repos, configure them manually on the first node. You should have the following repos configured:
Ocean
Ocean-updates
Greyzone
Greyzone-updates
CentOS
CentOS-updates
CentOS-extras
EPEL
Ocean missing
Gluster
Note
An ocean.repo
may be available at the root of your package repositories.
You may have to disable included repos using official CentOS mirrors to make yum
work. Use the --disablerepo
option to do so:
yum --disablerepo base,extras,updates makecache
Install some packages, disable permanently CentOS official repos and synchronize the system with the available packages:
yum -y --disablerepo base,extras,updates install yum-utils yum-plugin-priorities
yum-config-manager --disable base,extras,updates
yum distribution-synchronization
systemctl disable --now NetworkManager
yum remove -y NetworkManager\*
Network configuration¶
This node is connected to all available networks (backbone, management and administration). Backbone was configured by you in Anaconda’s text UI. Now configure all the internal networks using the following template and the addressing scheme designed in the off-site step:
# /etc/sysconfig/network-scripts/ifcfg-eno2
# Here eno2 is the management network, and is 10.0.0.1
DEVICE=eno2
BOOTPROTO=static
BROADCAST=10.0.0.255
IPADDR=10.0.0.1
NETMASK=255.255.255.0
NETWORK=10.0.0.0
ONBOOT=yes
You will most probably require some IP routes to be configured, if so don’t forget to set those in /etc/sysconfig/network-scripts/route-INTERFACE_NAME
Mellanox cards¶
If you’re using Mellanox VPI cards for 40G/50G/100G ethernet links, install the Mellanox OFED and load the drivers:
yum install -y mlnx-ofa_kernel kmod-mlnx-ofa_kernel ocean-fw-mlnx-hca infiniband-diags mstflint kmod-kernel-mft-mlnx unzip
systemctl start openibd
If needed, use the firmwares present in /usr/share/ocean-fw-mlnx-hca/firmware/
and the mstflint
tool to burn your firmware:
unzip %FIRMWARE%.bin.zip
mstflint -d 81:00.0 -i %FIRMWARE%.bin burn
Methods to get the card PSID and OPN can be found in /usr/share/ocean-fw-mlnx-hca/release_notes/README.txt
.
If needed and using the mstconfig
tool, verify and set the link type to Ethernet (a link type of 2 means Ethernet):
mstconfig -d 81:00.0 query | grep LINK_TYPE
mstconfig -y -d 81:00.0 set LINK_TYPE_P1=2
After configuing Mellanox card to Eternet, Flexboot mecanism is activated and may take a long time to initialize 40G links. To deactivate Flexboot:
mstconfig -d 81:00.0 q LEGACY_BOOT_PROTOCOL EXP_ROM_PXE_ENABLE
mstconfig -y -d 81:00.0 set LEGACY_BOOT_PROTOCOL=NONE EXP_ROM_PXE_ENABLE=0
After a reboot, the card should appear as a ensX
network device, and can be configured like the other interfaces.
MAC addresses gathering¶
If this is not done yet, here is a method to collect MAC addresses on the management network. We assume here that BMC are auto-configuring using DHCP.
Remember that some switches have some requirements (especially spanning-tree related) that have to be met. See Requirements for details.
Using SSH or a console cable, open a shell to the management network switch and display the ARP table.
Here we’re using a USB console cable on a Cisco Catalyst switch:
screen /dev/ttyACM1
Switch> show mac address-table
Using the displayed MAC/Port mapping, match them with the expected cabling (hwdb port list --local esw2
), insert them into confiture’s data files and re-generate the DHCP configuration.
The shut/no shut
trick may be applied on a switch port to force the equipment to relaunch the DHCP phase.
Note
Catalyst’s management interface don’t do DHCP by default, to activate it add ip address dhcp
the management interface configuration (fastethernet0
in our case). Get the interface’s MAC with show interface fastethernet0
For the backbone network, you may not have access to the switch. As there is only 3 nodes that will boot over it a simple tcpdump
while booting the node will do the job.
DHCP & Named installation¶
Using the node used for the off-site preparation phase, update confiture’s data with the discovered MACs, re-generate the dhcp configuration and import the dhcpd and named configuration files.
Put them in the right place and start bind and dhcpd.
Note
Some adjustements may have to be done on generated configuration. As a general rule, don’t modify generated files, modify templates and import back the generated files.
yum install -y dhcp bind bind-utils
systemctl enable --now named dhcpd
Now, configure the resolv.conf with yourself as a nameserver and verify that all BMC are now reachable.
LAN MACs gathering¶
Now gather the management node LAN interface MACs. To do so, either :
Make them boot on the network and collect the MACs:
Make sure that the interface is used by the BIOS for PXE (setting in BIOS menu or Option ROM)
Using IPMI set the next boot device to PXE:
ipmitool -U %USER% -P %PASSWORD% -H %HOSTNAME% chassis power off ipmitool -U %USER% -P %PASSWORD% -H %HOSTNAME% chassis bootdev pxe ipmitool -U %USER% -P %PASSWORD% -H %HOSTNAME% chassis power onCollectl the MACs on the switch or use tcpdump to capture DHCP requests:
$ screen /dev/ttyACM1 > show mac address-tableUse the BMC web interface on get the system’s LAN MAC address.
Use the BIOS or Option ROMs informations
On SuperMicro hardware, you can get the first LAN MACs by issuing the following IPMI raw command:
ipmitool -U %USER% -P %PASSWORD% -H %HOSTNAME% raw 0x30 0x21 | tail -c 18 | tr ' ' ':'
With those MACs gathered, update confiture’s data and update the DHCP configuration with the freshly generated configuration.
BMGR installation¶
Install the BMGR tool:
yum install -y bmgr
Start and initialize the database:
systemctl enable --now mariadb
mysql << EOF
grant all privileges on bmgr.* to bmgr_user@'localhost' identified by 'bmgr_pass';
create database bmgr;
EOF
FLASK_APP=bmgr.app flask initdb
Add the WSGI entrypoint into Apache’s configuration file:
echo 'WSGIScriptAlias /bmgr "/var/www/bmgr/bmgr.wsgi"' >> /etc/httpd/conf/httpd.conf
systemctl enable --now httpd
Test with the CLI:
bmgr host list
Configuration¶
Create node profiles and assign weight to them:
bmgr profile add -w 0 ocean_mngt
bmgr profile add -w 5 ocean_mngt_top
bmgr profile add -w 10 ocean_mngt_top_1
bmgr profile add -w 5 ocean_mngt_worker
bmgr profile add -w 5 ocean_mngt_islet_worker
Add the cluster nodes and associated profiles into bmgr:
bmgr host add --profiles ocean_mngt,ocean_mngt_top,ocean_mngt_top_1 top[1-3]
bmgr host add --profiles ocean_mngt,ocean_mngt_worker worker[1-3]
bmgr host add --profiles ocean_mngt,ocean_mngt_islet_worker islet[10-11,20-21,...]
Add profile specificities:
# The names of the network interface are given as configuration examples (see section 'Network Configuration')
bmgr profile update ocean_mngt_top_1 -a netdev enp130s0f0 -a ks_drive Volume0_0
bmgr profile update ocean_mngt_worker -a netdev enp3s0f0 -a ks_drive Volume0_0
bmgr profile update ocean_mngt_islet_worker -a netdev eno1 -a ks_drive Volume0_0
bmgr profile update ocean_mngt -a console ttyS1,115200 -a ks_selinux_mode disabled -a ks_firewall_mode disabled -a ks_rootpwd root -a kickstart http://top1-mngt/bmgr/api/v1.0/resources/kickstart/
Note
This strongly depends on your hardware specificities, it may be convenient to create additionnal profiles.
For example, Cisco Nexus 9K Zero-touch provisionning can use bmgr features to autoconfigure itself. It is up to administrators to design profiles hierarchy and attributes. This is only an example used in our test bed.
Moreover, to help you bmgr can assign weights to individual profiles, giving them a higher priority.
Deployment server¶
Lorax image¶
Kickstart process will use a custom boot image, this image will be generated with the lorax
tool.
Install lorax
:
yum install -y lorax
Launch the build process, with the package repo URLs defined in the repo file:
lorax -p Ocean -v ${oswanted} -r 1 $(sed -ne 's/^baseurl=/-s /p' /etc/yum.repos.d/ocean.repo) /var/www/html/boot
Configure bmgr
accordingly:
bmgr profile update ocean_mngt -a initrd http://top1-mngt/boot/images/pxeboot/initrd.img -a kernel http://top1-mngt/boot/images/pxeboot/vmlinuz -a install_tree http://top1-mngt/boot
Note
As top nodes may be deployed on a different physical network (backbone instead of internal network), bmgr and other configuration item may have to be duplicated between profiles. For example, for top nodes:
bmgr profile update ocean_mngt_top -a initrd http://top1-bbone/boot/images/pxeboot/initrd.img -a kernel http://top1-bbone/boot/images/pxeboot/vmlinuz -a install_tree http://top1-bbone/boot
Repositories¶
Kickstart process requires local repos, using reposync
and createrepo
create a temporary clone of CentOS repositories:
yum install -y createrepo
reposync -p /var/www/html/boot/packages -r centos-updates -r centos-os -r ocean -r ocean-updates -r ocean-missing -n -m
createrepo -g /var/www/html/boot/packages/centos-os/comps.xml /var/www/html/boot
Warning
Repository names (-r
arguments) may differ
Warning
This will roughly use 12Gb in the /var
filesystem
Package repository proxy¶
Using Apache, configure a proxy to your package repository:
cat > /etc/httpd/conf.d/mirror.conf << EOF
ProxyPass /mirror http://yumsrv.ccc.cea.fr/
ProxyPassReverse /mirror http://yumsrv.ccc.cea.fr/
EOF
systemctl reload httpd
Warning
Adapt the content of mirror.conf
with your repository URL. This should point to some URL where all the repos are available as subdirectories.
Configure bmgr
accordingly:
echo ${oswanted}
bmgr profile update ocean_mngt -a ks_repos http://top1-mngt/mirror/ocean/${oswanted}/ocean/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/ocean-updates/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/centos-os/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/centos-update/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/centos-extras/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/epel/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/greyzone/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/greyzone-updates/x86_64
Admin SSH key¶
Generate a SSH key, this one will be used after the kickstart process is finished (as no password will be set):
ssh-keygen -b 4096
cp ~/.ssh/id_rsa.pub /var/www/html/authorized_keys
cp ~/.ssh/id_rsa.pub /root/.ssh/authorized_keys
Configure bmgr
accordingly:
bmgr profile update ocean_mngt -a ks_authorized_keys_url http://top1-mngt/authorized_keys
TFTP server¶
A TFTP server is required for PXE chainloading. Install a TFTP server:
yum install -y xinetd tftp-server tftp
systemctl enable --now xinetd tftp
And make iPXE network boot loader images available through TFTP:
yum install -y ipxe-bootimgs
ln /usr/share/ipxe/{undionly.kpxe,ipxe.efi} /var/lib/tftpboot/
Warning
Symbolic links are not followed by TFTP server. Only use hardlinks or copy the file you want to serve.
DHCP update¶
Update the DHCP template and confiture’s data with deployment server specifics:
BMGR server URL
TFTP server IP
iPXE ROM name.
DNS IPs
Apply the configuration and restart the dhcp server.
Note
Some equipments may only support EFI rom, modify template to reflect this.
Worker nodes deployment¶
Now that we have everything required to kickstart a node, try to deploy the second node:
Double check that iPXE script and kickstart file are correct:
bmgr resource render ipxe_deploy_boot top2 bmgr resource render kickstart top2Note
Kickstart template may be modified, it is located in
/etc/bmgr/templates/ks_rhel7.jinja
Configure the BIOS with the settings mentioned above. Make sure that RAID device are present and correctly defined in kickstart file.
Set the next bootdev to PXE:
ipmitool -U %USER% -P %PASS% -H %BMC% chassis bootdev pxeEnable deploy mode in bmgr:
bmgr alias override -o ipxe_boot ipxe_deploy_boot top2
Start it and monitor the process with a remote console (either SOL or console redirection):
ipmitool -U %USER% -P %PASS% -H %BMC% chassis power on
When the node is fully kickstarted, it will be in a state where:
A minimal set of packages is installed
Proxied repos are configured
The interface used for deployment is configured. The other ones are not.
A ssh daemon is running
Root’s authorized_keys is deployed (with the given URL)
If you have Mellanox cards as a multi-gigabyte Ethernet card, you may have to flash and configure the same way as the first node, see Mellanox cards.
Make sure the storage you intend to use as a GlusterFS brick is available and ready-to-use. We strongly recommend a filesystem label to be set on the gluster block device. Use xfs_admin -L
to set a label on a XFS filesystem.
Ethernet fabric configuration¶
Switch configuration¶
The Ethernet fabric configuration may be configured by 2 different methods:
Manual initial configuration and generated configuration deployment
Zero touch provisioning (ZTP for Arista, POAP for Cisco Nexus)
Zero touch provisionning in very specific to your hardware and may require third-party tools or servers. We will only document manual process in this general-purpose installation guide.
Note
Cisco POAP is documented in this annex: Cisco PowerOn Auto Provisioning
This process requires a manual step for the initial switch configuration. Connect to each switch using a serial console and set up remote access. This usually includes:
IP Address assignement on the management interface
Administrative user creation
Priviledged shell (aka enable mode) password setup.
Testing from a remote host
Using the configuration file generated with confiture, test the configuration bits with the real-world switch. If everything seems good deploy it entirely using the already deployed TFTP server or HTTP server.
Note
This step might be iterative : test on the switch, fix the confiture template, redeploy and so on.
Node configuration¶
Todo
Regarder sur les différents clusters si une information est utile ici
Management stack deployment¶
Puppet server installation¶
Now install puppet server and all required components on the first node:
yum install -y puppet4 puppetserver puppet-global puppet-extras puppet-addons git rubygem-r10k rubygem-hocon emacs-nox emacs-yaml-mode vim
Create puppet’s required git repos :
git clone --mirror /usr/share/puppet-global /var/lib/puppet-global
git init --bare /var/lib/puppet-cccenv
echo 'ref: refs/heads/production' > /var/lib/puppet-cccenv/HEAD
git init --bare /var/lib/puppet-domain
echo 'ref: refs/heads/production' > /var/lib/puppet-domain/HEAD
Clone them locally:
mkdir /root/puppet
cd /root/puppet
git clone /var/lib/puppet-global global
git clone /var/lib/puppet-cccenv cccenv
git clone /var/lib/puppet-domain domain
And bootstrap cccenv
and domain
repos:
cd /root/puppet/cccenv
mkdir -p modules/empty/manifests files hieradata
touch modules/empty/manifests/empty.pp
git add .
git commit -m 'Initial commit'
git branch -m master production
git push -u origin HEAD:production
cd /root/puppet/domain
mkdir -p files/$(facter domain)/{all-nodes,nodes,hieradata}
ln -sf ../files/$(facter domain)/hieradata hieradata/$(facter domain)
git add .
git commit -m 'Initial commit'
git branch -m master production
git push -u origin HEAD:production
Set the upstream origin in case of puppet-global update:
cd /root/puppet/global
git remote add upstream /usr/share/puppet-global
Set the commiter’s name and email for each repo:
git --git-dir /root/puppet/global/.git config --local user.name "Super Admin"
git --git-dir /root/puppet/global/.git config --local user.mail "super.admin@ocean"
git --git-dir /root/puppet/cccenv/.git config --local user.name "Super Admin"
git --git-dir /root/puppet/cccenv/.git config --local user.mail "super.admin@ocean"
git --git-dir /root/puppet/domain/.git config --local user.name "Super Admin"
git --git-dir /root/puppet/domain/.git config --local user.mail "super.admin@ocean"
Configure r10k
manually, insert the following in /etc/puppetlabs/r10k/r10k.yaml
:
---
:cachedir: /var/cache/r10k
:sources:
:global:
remote: /var/lib/puppet-global
basedir: /etc/puppetlabs/code/environments
:deploy:
purge_whitelist: [ ".resource_types/*", ".resource_types/**/*" ]
Deploy the repos with r10k
:
r10k deploy environment -pv
Configure master’s ENC in /etc/puppetlabs/puppet/puppet.conf
:
[master]
node_terminus = exec
external_nodes = /sbin/puppet-external
Start the puppetserver
:
systemctl enable --now puppetserver
Set the current node (the first node) profile in /etc/puppet/puppet-groups.yaml
:
environments:
production: 'top1'
roles:
puppetserver: 'top1'
Test and then apply this profile:
puppet-check -v --server $(facter fqdn)
puppet-apply -v --server $(facter fqdn)
Note
This will manage all the files and components required to launch a puppet server. The only unmanaged thing are the 3 repos in /var/lib/
.
Note
Some warnings about missing augeas
lenses may appear in puppet-check
output, you can safely ignore them:
[...]
Augeas didn't load ... with Trapperkeep.lns
[...]
You now have a working puppet server.
Profile setup¶
Ocean’s includes a set a basic profiles that configures the management stack. Many of them requires configuration. Available profile are present in the hieradata/global
folder of the global
repo.
Warning
Following configuration files are only examples, adapt them with your deployment specificities
ClusterShell groups configuration¶
To have a convenient way to define nodes roles, define ClusterShell groups configuration this way:
sed -i -e 's/^default:.*/default: cluster/' /etc/clustershell/groups.conf
cat >/etc/clustershell/groups.d/cluster.yaml <<EOF
cluster:
top: 'top[1-3]'
worker: 'top[1-3],worker[1-3]'
i_worker: 'islet[10-11,20-21]'
puppetserver: 'top1'
etcd: '@top,@worker,@i_worker'
etcd_client: '@i_worker'
fleet: '@top,@worker,@i_worker'
gluster_server: '@top,@worker'
gluster_client: '@i_worker'
pcocc_standalone: '@top,@worker,@i_worker'
pcocc_standalone_top: '@top'
mngt_top: '@top'
mngt_common: '@top,@worker,@i_worker'
all: '@top,@worker,@i_worker'
EOF
Profile dispatch¶
In the puppet-groups.yaml
file, dispatch the profiles on the managements nodes:
environments:
production: '@all'
roles:
puppetserver: '@puppetsever'
etcd: '@etcd'
etcd_client: '@etcd_client'
fleet: '@fleet'
gluster_server: '@gluster_server'
gluster_client: '@gluster_client'
pcocc_standalone: '@pcocc_standalone'
pcocc_standalone_top: '@pcocc_standalone_top'
90_mngt_top: '@mngt_top'
91_mngt_common: '@mngt_common'
99-common: '@all'
Common configuration and resources¶
Most profile required a set of basic configuration and resources like network configuration. To do so, create common profiles in the domain
repo and configure it using the following content :
# hieradata/91_mngt_common.yaml
resources:
net::ifcfg:
"%{hiera('adm_interface')}":
mode: 'bridge'
bridge: 'bradm'
mtu: 9000
bradm:
mode: 'fromdns'
type: 'Bridge'
dnssuffix: '-adm'
mask: '255.255.255.0'
mtu: 9000
"bradm:data":
mode: 'fromdns'
dnssuffix: '-data'
mask: '255.255.255.0'
bridge: 'brmngt'
"%{hiera('mngt_interface')}":
mode: 'bridge'
bridge: 'brmngt'
brmngt:
mode: 'fromdns'
type: 'Bridge'
dnssuffix: '-mngt'
mask: '255.255.255.0'
#net::route:
# bradm:
# xtype: |
# content:10.3.0.0/24 via ADM_GATEWAY_IP
# 10.5.0.0/24 via ADM_GATEWAY_IP'
# hieradata/90_mngt_top.yaml
resources:
net::ifcfg:
"%{hiera('bbone_interface')}":
mode: 'bridge'
bridge: 'brbone'
brbone:
mode: 'fromdns'
type: 'Bridge'
dnssuffix: '-bbone'
mask: '255.255.255.0'
#net::route:
# brbone:
# xtype: 'content:default via BBONE_GATEWAY_IP'
Warning
You will most probably require some routes to be set on backbone interface, to do so instantiate a net::route
like in the comment of 90_mngt_top.yaml
file`.
Note
Some node-specific variable must be included there, create a node hiera data file (like hieradata/top1.yaml
for top1
) in the domain
repo. The following example specifies the administration IP and the fleet role for top1
.
# hieradata/top1.yaml
adm_interface: 'ens1'
bone_interface: 'enp130s0f0'
mngt_interface: 'eno2'
fleet_role: 'top'
Same kind of variables can be set using profile-specific variables.
To create a new profile, create a hieradata file in domain
or cccenv
repo and assign them to nodes in the /etc/puppet/puppet-groups.yaml
.
Commit your change:
git add hieradata
git commit -m "Common configuration"
Time synchronization¶
Because management daemons requires little clock skew, management nodes have to be synchronized.
Configure chrony
daemons be synchronized on the first node:
server <upstream ntp server> iburst
# Record the rate at which the system clock gains/losses time.
driftfile /var/lib/chrony/drift
# Enable kernel RTC synchronization.
rtcsync
# In first three updates step the system clock instead of slew
# if the adjustment is larger than 10 seconds.
makestep 10 3
# Allow NTP client access from local network.
allow <your network>
logdir /var/log/chrony
acquisitionport 123
Do the first sync with ntpdate
:
clush -bw top[2-3],worker[1-3] yum install -y ntpdate chrony
clush -bw top[2-3],worker[1-3] ntpdate top1-mngt.$(facter domain)
Configure chrony
on nodes:
cat >/tmp/chrony.conf <<EOF
server top1-mngt.$(facter domain) iburst
driftfile /var/lib/chrony/drift
rtcsync
makestep 10 3
logdir /var/log/chrony
EOF
clush -bw top[2-3],worker[1-3] ntpdate top1-mngt.$(facter domain)
Etcd configuration¶
In the hieradata
folder of the domain
repo, configure the etcd profile accordingly:
# hieradata/etcd.yaml
etcd::initial_cluster:
- 'top1=https://top1.%{::domain}:2380'
- 'top2=https://top2.%{::domain}:2380'
- 'top3=https://top3.%{::domain}:2380'
- 'worker1=https://worker1.%{::domain}:2380'
- 'worker2=https://worker2.%{::domain}:2380'
- 'worker3=https://worker3.%{::domain}:2380'
Commit your change:
git add hieradata/etcd.yaml
git commit -m "Initial etcd configuration"
Fleet configuration¶
Same for fleet
:
# hieradata/fleet.yaml
fleet::server::settings:
etcd_servers: "[\"https://top1.%{::domain}:2379\", \"https://top2.%{::domain}:2379\", \"https://top3.%{::domain}:2379\", \"https://worker1.%{::domain}:2379\", \"https://worker2.%{::domain}:2379\", \"https://worker3.%{::domain}:2379\",]"
etcd_username: 'root'
etcd_password: 'password'
public_ip: "%{::ipaddress_bradm}"
metadata: "'hostname=%{::hostname},role=%{hiera('fleet_role')}'"
enable_grpc: "true"
Commit your change:
git add hieradata
git commit -m "Initial fleet configuration"
Gluster configuration¶
Gluster profile requires some configuration and already mounted bricks. To do so, in the hieradata folder of the domain repo, configure the gluster_server
profile and requirements:
# hieradata/gluster_server.yaml
resources:
gluster::mount:
'/volspoms1':
volume: 'top1-data,top2-data,top3-data,worker1-data,worker2-data,worker3-data:/volspoms1'
'/volspoms2':
volume: 'top1-data,top2-data,top3-data,worker1-data,worker2-data,worker3-data:/volspoms2'
file:
'/gluster':
ensure: 'directory'
owner: 'root'
group: 'root'
mode: '0755'
tag: 'gluster'
'/gluster/brick1':
ensure: 'directory'
owner: 'root'
group: 'root'
mode: '0755'
tag: 'gluster'
'/gluster/brick2':
ensure: 'directory'
owner: 'root'
group: 'root'
mode: '0755'
tag: 'gluster'
'/gluster/brick3':
ensure: 'directory'
owner: 'root'
group: 'root'
mode: '0755'
tag: 'gluster'
'/gluster/brick4':
ensure: 'directory'
owner: 'root'
group: 'root'
mode: '0755'
tag: 'gluster'
'/volspoms1':
ensure: 'directory'
owner: 'root'
group: 'root'
mode: '0755'
tag: 'gluster'
'/volspoms2':
ensure: 'directory'
owner: 'root'
group: 'root'
mode: '0755'
tag: 'gluster'
mount:
'/gluster/brick1':
ensure: 'mounted'
fstype: 'xfs'
device: "/dev/mapper/gluster-brick1"
options: 'defaults,noatime,auto'
dump: '1'
pass: '2'
tag: 'gluster'
require: 'File[/gluster/brick1]'
'/gluster/brick2':
ensure: 'mounted'
fstype: 'xfs'
device: "/dev/mapper/gluster-brick2"
options: 'defaults,noatime,auto'
dump: '1'
pass: '2'
tag: 'gluster'
require: 'File[/gluster/brick2]'
'/gluster/brick3':
ensure: 'mounted'
fstype: 'xfs'
device: "/dev/mapper/gluster-brick3"
options: 'defaults,noatime,auto'
dump: '1'
pass: '2'
tag: 'gluster'
require: 'File[/gluster/brick3]'
'/gluster/brick4':
ensure: 'mounted'
fstype: 'xfs'
device: "/dev/mapper/gluster-brick4"
options: 'defaults,noatime,auto'
dump: '1'
pass: '2'
tag: 'gluster'
require: 'File[/gluster/brick4]'
gluster::peer:
"top1-data.%{::domain}":
fqdn: "top1.%{::domain}"
pool: 'production'
require: 'Class[Gluster::Service]'
"top2-data.%{::domain}":
fqdn: "top2.%{::domain}"
pool: 'production'
require: 'Class[Gluster::Service]'
"top3-data.%{::domain}":
fqdn: "top3.%{::domain}"
pool: 'production'
require: 'Class[Gluster::Service]'
"worker1-data.%{::domain}":
fqdn: "worker1.%{::domain}"
pool: 'production'
require: 'Class[Gluster::Service]'
"worker2-data.%{::domain}":
fqdn: "worker2.%{::domain}"
pool: 'production'
require: 'Class[Gluster::Service]'
"worker3-data.%{::domain}":
fqdn: "worker3.%{::domain}"
pool: 'production'
require: 'Class[Gluster::Service]'
gluster::volume:
'volspoms1':
replica: 3
arbiter: 1
options:
- 'features.shard: true'
- 'features.shard-block-size: 64MB'
- 'nfs.disable: true'
# Virt group
- 'performance.quick-read: false'
- 'performance.read-ahead: false'
- 'performance.io-cache: false'
- 'performance.low-prio-threads: 32'
- 'network.remote-dio: enable'
- 'cluster.eager-lock: enable'
- 'cluster.quorum-type: auto'
- 'cluster.server-quorum-type: server'
- 'cluster.data-self-heal-algorithm: full'
- 'cluster.locking-scheme: granular'
- 'cluster.shd-max-threads: 8'
- 'cluster.shd-wait-qlength: 10000'
- 'user.cifs: false'
require:
- "Gluster::Peer[top1-data.%{::domain}]"
- "Gluster::Peer[top2-data.%{::domain}]"
- "Gluster::Peer[top3-data.%{::domain}]"
- "Gluster::Peer[worker1-data.%{::domain}]"
- "Gluster::Peer[worker2-data.%{::domain}]"
- "Gluster::Peer[worker3-data.%{::domain}]"
- 'Mount[/gluster/brick1]'
- 'Mount[/gluster/brick2]'
- 'Mount[/gluster/brick3]'
bricks:
# 1st node - 2nd node - 3nd node
# Data - Data - Arbiter
- "top1-data.%{::domain}:/gluster/brick1/data"
- "top2-data.%{::domain}:/gluster/brick1/data"
- "top3-data.%{::domain}:/gluster/brick1/data"
# Data - Arbiter - Data
- "top3-data.%{::domain}:/gluster/brick2/data"
- "top1-data.%{::domain}:/gluster/brick2/data"
- "top2-data.%{::domain}:/gluster/brick2/data"
# Arbiter - Data - Data
- "top2-data.%{::domain}:/gluster/brick3/data"
- "top3-data.%{::domain}:/gluster/brick3/data"
- "top1-data.%{::domain}:/gluster/brick3/data"
# Data - Data - Arbiter
- "worker1-data.%{::domain}:/gluster/brick1/data"
- "worker2-data.%{::domain}:/gluster/brick1/data"
- "worker3-data.%{::domain}:/gluster/brick1/data"
# Data - Arbiter - Data
- "worker3-data.%{::domain}:/gluster/brick2/data"
- "worker1-data.%{::domain}:/gluster/brick2/data"
- "worker2-data.%{::domain}:/gluster/brick2/data"
# Arbiter - Data - Data
- "worker2-data.%{::domain}:/gluster/brick3/data"
- "worker3-data.%{::domain}:/gluster/brick3/data"
- "worker1-data.%{::domain}:/gluster/brick3/data"
'volspoms2':
replica: 3
options:
- 'nfs.disable: true'
require:
- "Gluster::Peer[top1-data.%{::domain}]"
- "Gluster::Peer[top2-data.%{::domain}]"
- "Gluster::Peer[top3-data.%{::domain}]"
- "Gluster::Peer[worker1-data.%{::domain}]"
- "Gluster::Peer[worker2-data.%{::domain}]"
- "Gluster::Peer[worker3-data.%{::domain}]"
- 'Mount[/gluster/brick4]'
bricks:
- "top1-data.%{::domain}:/gluster/brick4/data"
- "top2-data.%{::domain}:/gluster/brick4/data"
- "top3-data.%{::domain}:/gluster/brick4/data"
- "worker1-data.%{::domain}:/gluster/brick4/data"
- "worker2-data.%{::domain}:/gluster/brick4/data"
- "worker3-data.%{::domain}:/gluster/brick4/data"
You need to prepare the bricks on each node:
pvcreate --dataalignment 768k /dev/sdb
vgcreate --physicalextentsize 768K gluster /dev/sdb
lvcreate --thin gluster/thin_pool --extents 100%FREE --chunksize 256k --poolmetadatasize 16G --zero n
lvcreate --thin --name brick1 --virtualsize 1.25t gluster/thin_pool
lvcreate --thin --name brick2 --virtualsize 1.25t gluster/thin_pool
lvcreate --thin --name brick3 --virtualsize 1.25t gluster/thin_pool
lvcreate --thin --name brick4 --virtualsize 2.5t gluster/thin_pool
mkfs.xfs -f -d su=256k,sw=1 -s size=512 /dev/mapper/gluster-brick1
mkfs.xfs -f -d su=256k,sw=1 -s size=512 /dev/mapper/gluster-brick2
mkfs.xfs -f -d su=256k,sw=1 -s size=512 /dev/mapper/gluster-brick3
mkfs.xfs -f -d su=256k,sw=1 -s size=512 /dev/mapper/gluster-brick4
Note
Here we assumed 6 × 960GB drives configured in a RAID 10. Have a look at the original RedHat procedure here: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html/administration_guide/brick_configuration
Also add gluster_client
profile configuration:
# hieradata/gluster_client.yaml
resources:
file:
'/volspoms1':
ensure: 'directory'
owner: 'root'
group: 'root'
mode: '0755'
tag: 'gluster'
'/volspoms2':
ensure: 'directory'
owner: 'root'
group: 'root'
mode: '0755'
tag: 'gluster'
gluster::mount:
'/volspoms1':
volume: 'top1-data,top2-data,top3-data,worker1-data,worker2-data,worker3-data:/volspoms1'
'/volspoms2':
volume: 'top1-data,top2-data,top3-data,worker1-data,worker2-data,worker3-data:/volspoms2'
Commit your change:
git add hieradata
git commit -m "Initial gluster configuration"
Pcocc configuration¶
Pcocc profile requires some configuration. To do so, in the hieradata folder of the domain repo, configure the pcocc_standalone
and pcocc_standalone_top
profiles and requirements:
# hieradata/pcocc_standalone.yaml
pcocc::config::batch::etcd_servers:
- "top1.%{::domain}"
- "top2.%{::domain}"
- "top3.%{::domain}"
- "worker1.%{::domain}"
- "worker2.%{::domain}"
- "worker3.%{::domain}"
pcocc::config_etcd_pwd_xtype: 'content:password'
pcocc::config::repos::repos:
- name: volspoms
path: /volspoms1/pcocc
pcocc::config::networks:
adm:
type: bridged
host_bridge: "bradm"
tap_prefix: "admtap"
mtu: 9000
mngt:
type: bridged
host_bridge: "brmngt"
tap_prefix: "mntap"
mtu: 1500
pcocc::config::templates:
generic:
resource_set: 'default'
user_data: '/etc/pcocc/cloudinit/generic.yaml'
image: 'volspoms:cloud-ocean2.6'
# hieradata/pcocc_standalone_top.yaml
pcocc::config::networks:
bone:
type: bridged
host_bridge: "brbone"
tap_prefix: "bbtap"
mtu: 1500
Commit your change:
git add hieradata
git commit -m "Initial Pcocc configuration"
Profile application¶
Push and deploy all your changes to the puppetserver:
git --git-dir /root/puppet/global/.git push
git --git-dir /root/puppet/domain/.git push
git --git-dir /root/puppet/cccenv/.git push
r10k deploy environment -pv
Puppet agents bootstrap¶
Install puppet4
and puppet-addons
on the management nodes:
clush -bw top[2-3],worker[1-3] yum install -y puppet4 puppet-addons
Bootstrap SSL certificates with a puppet-check
:
clush -bw top[2-3],worker[1-3] puppet-check --tags net --server top1.$(facter domain)
clush -bw top[2-3],worker[1-3] -R exec puppet cert sign %h.$(facter domain)
Network configuration¶
Warning
Interface names will most probably not match the ones you will have. Please make sure you have reset the right interface. Here interface are named as follows:
Host |
Backbone |
Management |
Administration |
---|---|---|---|
top1 |
enp130s0f0 |
eno2 |
ens1 |
top2 |
enp130s0f0 |
eno2 |
ens1 |
top3 |
enp3s0f0 |
enp3s0f1 |
ens6f0 |
worker1 |
enp3s0f0 |
ens6f0 |
|
worker2 |
enp3s0f0 |
ens6f0 |
|
worker3 |
enp3s0f0 |
ens6f0 |
Network configuration (and bridges) require a bit more work to apply them. Here, we’ll apply configuration files and then restart interfaces.
First, apply network configuration files:
clush -bw top[1-3],worker[1-3] puppet-apply --tags net --server top1.$(facter domain)
Double-check that all node have their network configuration file correctly set :
clush -bw top[1-3],worker[1-3] 'more /etc/sysconfig/network-scripts/ifcfg-* | cat'
This will only change files, no restart is done. Make sure you have connectivity on all networks:
clush -bw top[1-3],worker[1-3] uname
clush -bw top[1-3]-bbone uname
clush -bw worker[1-3]-mngt uname
Stop the management interfaces:
clush -bw top[1-2] ifdown eno2
clush -bw top3 ifdown enp3s0f1
clush -bw worker[1-3] ifdown enp3s0f0
Start the management bridge:
clush -bw top[1-3],worker[1-3] ifup brmngt
Start the management interfaces:
clush -bw top[1-2] ifup eno2
clush -bw top3 ifup enp3s0f1
clush -bw worker[1-3] ifup enp3s0f0
And now make sure that management nodes are still reachable through all networks:
clush -bw top[1-3],worker[1-3] uname
clush -bw top[1-3]-bbone,worker[1-3]-bbone-mngt uname
Using the same technique, restart administration network to make sure all interfaces are bridged correctly:
# Stop administration interfaces
clush -bw top[1-2] ifdown ens1
clush -bw top3,worker[1-3] ifdown ens6f0
# Start administration bridges
clush -bw top[1-3],worker[1-3]-mngt ifup bradm
# Start administration interfaces
clush -bw top[1-2] ifup ens1
clush -bw top3,worker[1-3]-mngt ifup ens6f0
Make sure once more that everything went fine:
clush -bw top[1-3],worker[1-3] uname
clush -bw top[1-3],worker[1-3]-mngt uname
Using the same technique, restart backbone network to make sure all interfaces are bridged correctly:
# Stop backbone interfaces
clush -bw top2 ifdown enp130s0f0
clush -bw top3 ifdown enp3s0f0
# Start backbone bridges
clush -bw top[2-3] ifup brbone
# Start backbone interfaces
clush -bw top2 ifup enp130s0f0
clush -bw top3 ifup enp3s0f0
By using top2
as a relay, do the same thing on the first node (make sure you don’t deconnect yourself !), :
ifdown enp130s0f0; ifup brbone; ifup enp130s0f0
Gluster cluster bootstrap¶
After making sure that all nodes can reach other ones using the -data
suffix, apply the gluster_server
profile:
clush -bw top[1-3],worker[1-3] puppet-apply-changes --tags gluster --server top1.$(facter domain)
This should have installed gluster packages, created and mounted the /gluster
that will contain all gluster bricks.
And then apply a second time only on one node, this step will setup gluster peers and volumes:
puppet-apply-changes --tags gluster --server top1.$(facter domain)
And launch it a third time to mount it on all nodes:
clush -bw top[1-3],worker[1-3] puppet-apply-changes --tags gluster --server top1.$(facter domain)
Note
In case of failures launch the puppet run with a puppet-apply -v
instead of puppet-apply-changes
and investigate for configuration errors.
Now, Gluster should be up and running:
clush -bw top[1-3],worker[1-3] df -ht fuse.glusterfs
Apply quota to the 2 volumes:
gluster volume quota volspoms1 enable
# For a 5.5TB quota (bash needs integers, so 11/2)
gluster volume quota volspoms1 limit-usage / $((1024*1024*1024*1024*11/2)) 95
gluster volume quota volspoms2 enable
# For a 1TB quota
gluster volume quota volspoms2 limit-usage / $((1024*1024*1024*1024))
In order to have the quotas working correctly you need at least one file (even an empty one) on each gluster filesystem.
Now the quotas should be ok:
gluster volume quota volspoms1 list
gluster volume quota volspoms2 list
Etcd cluster bootstrap¶
Apply the profile on all nodes:
clush -bw top[1-3],worker[1-3] puppet-apply-changes --tags etcd --server top1.$(facter domain)
Check that the cluster bootstraped correctly:
etcdctl -C https://$(facter fqdn):2379 cluster-health
Activate authentication and remove default roles:
# etcdctl -C https://$(facter fqdn):2379 user add root
New password:
User root created
# etcdctl -C https://$(facter fqdn):2379 auth enable
Authentication Enabled
# etcdctl -C https://$(facter fqdn):2379 -u root role remove guest
Password:
Role guest removed
Check that user root is the only user left:
etcdctl -C https://$(facter fqdn):2379 -u root user list
Password:
root
Check that everything is still correct with:
etcdctl -C https://$(facter fqdn):2379 -u root cluster-health
Fleet cluster bootstrap¶
Apply the profile on all nodes:
clush -bw top[1-3],worker[1-3] puppet-apply-changes --tags fleet --server top1.$(facter domain)
Check that the cluster bootstraped correctly:
fleetctl list-machines
Pcocc configuration¶
Note
PCOCC documentation and tutorials can be found here: https://pcocc.readthedocs.io/en/latest/index.html
Apply the profile on all nodes:
clush -bw top[1-3],worker[1-3] puppet-apply-changes --tags pcocc --server top1.$(facter domain)
Check that pcocc is configured correctly:
clush -bw top[1-3],worker[1-3] pcocc internal setup init
clush -bw top[1-3],worker[1-3] pcocc internal setup cleanup
Service VM installation¶
Most (if not all) ocean’s services are hosted within service VM. Each of them hosts some roles which are configured using puppet profiles. As an example, we consider the following VMs:
madmin
Being the central administration node, no real service is hosted here
VM bootstrap¶
Adding a new VM in an Ocean cluster is a three step process:
Allocate and configure an DHCP-provided IP
Bootstrap and configure the VM
Launch it into the highly-available launch system
VM allocation¶
Launch and open a shell into the site-prep virtual machine. Inside confiture
configuration directory, allocate new IPs in hiera/addresses.yaml
and MACs in hiera/hwaddrs.yaml
.
Tip
MAC allocations may use the locally-managed MAC range.
We advise to use 52:54:XX:00:YY
on each network ie. 52:54:00:00:00
for the first VM on one network, 52:54:01:00:00
for first VM on the other network.
clush
may ease this step:
clush -w vm[0-20] -R exec 'printf \"52:54:01:00:%%02x\" %n'
Generate the DNS and DHCP configuration:
confiture dhcp
confiture dns
Deploy the generated files on the first node.
VM Configuration¶
Insert into domain
’s hieradata the VM definition using the following template :
# .ssh/authorized_keys content
vm_authorized_keys:
- "ssh-rsa %% PUB KEY CONTENT %%"
vm_yum_repos:
- 'name': 'centos-os'
'url': 'http://ocean'
'priority': '60'
- 'name': 'centos-update'
'url': 'http://ocean'
'priority': '50'
resources:
pcocc::standalone::vm:
VM_NAME:
reference_image_name: 'volspoms:cloud-ocean2.6'
fleet: true
resource_set: 'service'
cpu_count: 4
mem_per_cpu: 1000
ethernet_nics:
adm: 'ALLOCATED_MAC_ADM'
bbone: 'ALLOCATED_MAC_BBONE'
ssh_authorized_keys: "%{alias('vm_authorized_keys')}"
yum_repos: "%{alias('vm_yum_repos')}"
persistent_drive_dir: '/volspoms1/pcocc/persistent_drives'
persistent_drives:
- '/volspoms1/pcocc/persistent_drives/VM_NAME.qcow2'
Generate VM’s puppet SSL certificate on the puppet master (the first node so far) and add it into the domain
repo:
#default VM_NAME="admin1 admin2 batch1 batch2 db1 i0con1 i0conf2 i0log1 infra1 infra2 lb1 lb2 monitor1 ns1 ns2 ns3 nsrelay1 webrelay1 webrelay2"
puppet cert generate $VM_NAME.$(facter domain)
mkdir -p /root/puppet/domain/nodes/$VM_NAME/etc/puppetlabs/puppet/ssl/{certs,private_keys}
cp /etc/puppetlabs/puppet/ssl/certs/$VM_NAME.pem /root/puppet/domain/nodes/$VM_NAME/etc/puppetlabs/puppet/ssl/certs/
cp /etc/puppetlabs/puppet/ssl/private_keys/$VM_NAME.pem /root/puppet/domain/nodes/$VM_NAME/etc/puppetlabs/puppet/ssl/private_keys/
Generate SSH host keys and add those into the domain
repo:
mkdir -p /root/puppet/domain/nodes/${VM_NAME}/etc/ssh
ssh-keygen -t dsa -P "" -f /root/puppet/domain/nodes/${VM_NAME}/etc/ssh/ssh_host_dsa_key
ssh-keygen -t rsa -P "" -f /root/puppet/domain/nodes/${VM_NAME}/etc/ssh/ssh_host_rsa_key
ssh-keygen -t ecdsa -P "" -f /root/puppet/domain/nodes/${VM_NAME}/etc/ssh/ssh_host_ecdsa_key
ssh-keygen -t ed25519 -P "" -f /root/puppet/domain/nodes/${VM_NAME}/etc/ssh/ssh_host_ed25519_key
#
cd /root/puppet/domain/nodes/
git add -A .
Assign roles to the newly created VM in the ENC configuration file:
# /etc/puppet/puppet-groups.yaml
environments:
production: 'VM_NAME'
roles:
90-common: 'VM_NAME'
Commit and apply the configuration :
git commit
git --git-dir /root/puppet/domain/.git push
r10k deploy environment -pv
puppet-check -v --server $(facter fqdn)
puppet-apply -v --server $(facter fqdn)
To launch the VM manually, link fleet
’s unit file into systemd
search path and start the service:
systemctl link /etc/fleet/units/pcocc-vm-VM_NAME.service
systemctl start pcocc-vm-VM_NAME.service
You can follow the bootstrap process using the pcocc console
command:
pcocc console -J VM_NAME vm0
First node reinstallation¶
Todo
Stack validation
First node reinstallation/reintegration
Etcd reintegration¶
It may happen that gluster or etcd looses their cluster environment leading to leave the node top1
alone.
Etcd does not start and in etcd log you can find a line that contains something like member cd99d5c27dc7998b has already been bootstrapped
.
One way to recover this situation is to remove the old member top1
from the cluster by doing the following :
Get the id of the node to be removed by listing members of the cluster.
Remove the etcd data dir on
top1
machine that has been reinstalled.Change in
/etc/etcd/etcd.conf
the variable value ofETCD_INITIAL_CLUSTER_STATE
fromnew
toexisting
(still ontop1
machine).Add the new member to the cluster using the same functionnal endpoint.
Start etcd service on
top1
.
You will need etcd root password that should be stored in puppet when running some of commands below. All those commands are to be executed on top1
machine:
# etcdctl --endpoint https://top2.mg1.cloud.domain.fr:2379 member list
# etcdctl --endpoint https://top2.mg1.cloud.domain.fr:2379 -u root member remove cd99d5c27dc7998b
# mv /var/lib/etcd/top1.etcd /root/
# sed -i -e 's/ETCD_INITIAL_CLUSTER_STATE=\"new\"/ETCD_INITIAL_CLUSTER_STATE=\"existing\"/' /etc/etcd/etcd.conf
# etcdctl --endpoint https://top2.mg1.cloud.domain.fr:2379 -u root member add top1 https://top1.mg1.cloud.ccc.cea.fr:2380
# systemctl start etcd
One may check that the cluster is healthy:
# etcdctl --endpoint https://top2.mg1.cloud.domain.fr:2379 cluster-health
member 3910b3c7ca5407b is healthy: got healthy result from https://top1.mg1.cloud.domain.fr:2379
member 5c4493d902893be8 is healthy: got healthy result from https://top3.mg1.cloud.domain.fr:2379
member 647b167693a963df is healthy: got healthy result from https://top2.mg1.cloud.domain.fr:2379
cluster is healthy
Gluster reintegration¶
While bricks should not have been impacted by top1
node reinstallation recovering gluster’s cluster consists in getting the old gluster uuid from the cluster and then readd it to the new top1
machine. First we need to know the old uuid of top1
by runnning the following command on the remaining gluster nodes:
# gluster peer status
Number of Peers: 2
Hostname: top3-data.mg1.cloud.domain.fr
Uuid: 9acf71b8-263b-4056-8a2a-d4051911487c
State: Peer in Cluster (Connected)
Other names:
top3-data
Hostname: top1-data.mg1.cloud.domain.fr
Uuid: 5dd0b3f7-ab7e-4d9a-98f2-05395eec7891
State: Peer Rejected (Connected)
Now, on top1
stop the glusterd daemon and edit /var/lib/glusterd/glusterd.info
to replace the uuid there with the old one:
# systemctl stop glusterd
# sed -i -e 's/UUID=.*/UUID=5dd0b3f7-ab7e-4d9a-98f2-05395eec7891' /var/lib/glusterd/glusterd.info
Retrieve peers information files from other nodes and delete the one that correponds to top1
:
# mkdir /root/peers
# scp top2:/var/lib/glusterd/peers/* /root/peers/
# scp top3:/var/lib/glusterd/peers/* /root/peers/
# rm /root/peers/5dd0b3f7-ab7e-4d9a-98f2-05395eec7891
Then copy this peers information files into /var/lib/glusterd/peers/
:
# cp /root/peers/* /var/lib/glusterd/peers/
Restart gluster daemon:
# systemctl start glusterd
Then start healing the volumes:
# gluster volume heal volspoms1 full
# gluster volume heal volspoms2 full
To see self-heal status of the volume you may execute the following commands:
# gluster volume heal volspoms1 info
# gluster volume heal volspoms2 info
When “Number of entries:” is at 0 on every brick then everything is Ok.
Todo
Proposer un ordre d’installation de VM ?
Ne pas oublier dans ntp de supprimer top1 comme serveur de référence si cela a été fait
Todo
Réintégrer ce qui suit dans une partie plus générale concernant Puppet.
Classes principales affectant les différentes VMs
admin1 admin2 batch1 batch2 db1 i0con1 i0conf2 i0log1 infra1 infra2 lb1 lb2 monitor1 ns1 ns2 ns3 nsrelay1 webrelay1 webrelay2
ns:
dns_client
dns_server
gluster_client
ldap_server
log_client
log_client_islet0
mail_client
monitored_server
ntp_client
nsrelay:
dns_client
gluster_client
ldap_fuse
log_client
mail_client
monitored_server
ntp_client
webrelay:
dns_client
gluster_client
log_client
log_client_islet0
mail_client
monitored_server
ntp_client
webrelay
ilog:
conman_server
conman_server_islet0
dns_client
gluster_client
log_client
log_client_islet0
mail_client
monitored_server
ntp_client
infra:
dhcp_server
dns_client
gluster_client
log_client
log_client_islet0
mail_client
monitored_server
ntp_client
tftp_server
db:
auks_server
clary_server
dns_client
gluster_client
ldap_client
log_client
log_client_islet0
mail_client
monitored_server
ntp_client
slurm_db
lb:
dns_client
dns_server
gluster_client
haproxy_server
haproxy_server_http
haproxy_server_ldap
haproxy_server_puppet
log_client
log_client_islet0
mail_client
monitored_server
ntp_client
i0conf:
bmgr
bmgr_server
dns_cache
dns_client
dns_server
gluster_client
log_client
log_client_islet0
mail_client
monitored_server
ntp_client
puppetserver
puppetserver_islet0
racktables
webserver
webserver_islet0
Compute node installation¶
Todo
Flash compute racks, configure switches, generate diskless images
Routers, logins & other service nodes¶
Todo
Flash services nodes, configure additional switches, kickstart nodes
Interconnect fabric configuration¶
Todo
OpenSM/BXI AFM installation and configuration