Installation methodology

Todo

Offsite, Onsite and production phases, (message d’origine)

Offsite, Onsite and production phases

La phase Offsite comprend

  • la prise en compte de l’implentation en salle machine du calculateur

  • l’intégratrion des données Constructeur correspondant aux matériels à installer et à configurer

  • la mise en place d’une plateforme et solution logicielle en charge de l’enregistrement et la vie des informations précédentes

    • machine ou vm supportant la pile logicielle Ocean

    • paquets logiciels innovants pour:

      • intégrer rapidement les évolutions structurelles du calculateur

      • interfacer les données de différents constructeurs dans un format unifié et standardisé YAML

      • faciliter l’exploitation au quotidien et au maintien en conditions opérationnelles

  • la création d’un média DVD ou USB amorçant le déploiement le premier noeud du calculateur

  • la configuration du réseau et plan d’adressage

  • la virtualisation telle qu’elle va être utilisée dans cette pile logicielle

  • la configuration et l’organisation du stockage et des couches logicielles associées

La phase Onsite constitue l’installation initiale de l’ensemble des éléments constitutant ce calculateur. On y décrit toutes les étapes de mises en place des sous-ensembles noeuds et infrastructure réseau nécessaires au bon fonctionnement de l’ensemble. Dans cette phase, on y trouve également l’intégration de la plateforme précédente.

La phase Production est l’exploitation de la phase Onsite. Associée aux sections ‘Administration’, ‘HandBooks’ et ‘CookBooks’, elle permet le maintien en conditions opérationnelles du calculateur. Elle intégre naturellement les outils de gestion d’administration, de ‘Monitoring’ et les automates d’actions auto-correctives afin d’assurer une haute disponibilité.

La Formation aux outils et des dispositions fonctionnelles constitue un élément essentiel au travail collaboratif entre les équipes.

Offsite preparation

Données Constructeur

Cette étape va permettre de récupérer et croiser les différentes informations :

  • l’emplacement physique (topologie en salle machine) fournie par le Client

  • la constitution physique du calculateur (fournie par le Constructeur)

    • le nombre d’éléments ‘rack’ et leur hauteur

    • les sous-ensembles constituant chaque rack

    • les capacités de connectique de chaque sous-ensemble

    • les liaisons entre connectiques des différents sous-ensembles

    • l’adressage MAC d’une partie des équipements (top et worker), et pour le reste le support de l’option 82 (dhcp)

Tous ces éléments vont être consolidés dans une base de données. Cette base, dans un premier temps externe au calculateur sera ré-introduite lors de la phase Onsite. Elle sera utilisée tout au long de la vie du calculateur, entre autres, pour les interventions matérielles. Les éléments Constructeur sont fournis sous forme de ‘netlist’. Une compréhension de cette dernier voire une évolution de sa présentation pourra être demandée à celui-ci.

Plateforme et logiciels pour la connaissance structurelle

La mise en place des outils Ocean pour la gestion et vie de l’environnement demande la mise en place d’un système.

Le choix de l’utilsation d’une VM donne plus de souplesse, de la mobilité et de sécurité (sauvegarde d’un seul fichier qcow)

VM SitePrep

Installation et configuration Pcocc pour une vm siteprep

Todo

Recuperation des notes de la mise en place de la vm mtest

Installation de Racktables, hwdb et netcc

  • RackTables:

    • Installation

      yum install -y RackTables mariadb-server mariadb
      
    • Apache integration

      mkdir /var/www/html/racktables
      ln -s /usr/share/RackTables/wwwroot/index.php /var/www/html/racktables
      
    • Starting services and following instructions displayed on (%HOST%/racktables)

      systemctl start httpd mariadb
      # Step 1: login and password are described in secret.php
      touch '/etc/RackTables/secret.php'; chmod a=rw '/etc/RackTables/secret.php'
      # Step 3
      mysql << EOF
        CREATE DATABASE racktables_db CHARACTER SET utf8 COLLATE utf8_general_ci;
        CREATE USER racktables_user@localhost IDENTIFIED BY 'MY_SECRET_PASSWORD';
        GRANT ALL PRIVILEGES ON racktables_db.* TO racktables_user@localhost;
      EOF
      # Step 4
      chmod 440 /etc/RackTables/secret.php
      chown apache:apache /etc/RackTables/secret.php
      
  • hwdb (inside package confiture):

    • Row insertion:

    hwdb obj add -t Row A
     hwdb obj list
    
    • Rack by row:

    hwdb obj add -t Rack --container A A[3-7]
     hwdb obj list
    
    • Restauration des types, models, ports et compatiblités

      Todo

      Uniformiser les options de restauration !

      hwdb port type restore rt_dumps/ptypes.dump
      hwdb port compat restore --csv rt_dumps/pcompat.dump
      hwdb obj model restore rt_dumps/models.dump
      
    • Insertion des Sequana2

      hwdb cell add --rack A6 --prefix s20 templates/sequana2.hw.yaml
      hwdb cell add --rack A7 --prefix s22 templates/sequana2.hw.yaml
      
    • Insert switchs, diskarrays, top and worker (example)

       # inserts 2 servers (2U) at base level 15 and 17 for rack A4
       hwdb obj add -t server --container A4 --label top --slots 15,17 --size 2 top[1,2]
       hwdb obj update --model "SuperMicro 2U" x430 top[1-3]
       hwdb obj update --model "SuperMicro 2U" x430 worker[1-3]
      
      # inserts 2 x nexus 9364c in rack A3 at base level 6 and 9
       hwdb obj add -t 'network switch' --container A3 --label nexus-9364c \
                                        --slots 6,9 --size 2 esw[1-2]
      
      # inserts 4 x 3650 in rack A4 between level 6 and 9
       hwdb obj update --model "Nexus" 9364c esw[1-2]
       hwdb obj add -t 'network switch' --container A4 --label 3650 \
                                        --slots 6,7,8,9 --size 1 esw[3-6]
       hwdb obj update --model "Cisco" 3650 esw[3-6]
      
      # inserts jbod in rack A5 at base level 6 and 9
       hwdb obj add -t DiskArray --container A6 --label jbod-r6 \
                                    --slots 3 --size 2 --model "SuperMicro 2U" x430 yyy
      
      # Insert Colddoor
       hwdb obj add -t PDU --container A3 --label cooldoor --slots 1 --size 1 --subcontainer rear i0r0cooldoor0
      
    • Inserts links

     hwdb port add --label master -t hardwired 1000Base-T node180 Ethernet1
    hwdb port add --label slave -t hardwired 1000Base-T node180 Ethernet2
    # And links
    hwdb port link i10esw1  Ethernet4 node180 Ethernet1
    hwdb port link i10esw2  Ethernet4 node180 Ethernet2
    hwdb port update --label 'master opt82 shared=BMC nolag' node180 Ethernet1
    hwdb port update --label 'slave opt82 shared=BMCslave nolag' node180 Ethernet2
    #
    hwdb port compat add --from 1000Base-T --to 'empty SFP+'
    # Uplinks
    hwdb port add -t QSFP+ 'empty QSFP' esw[1-2] Ethernet[1-48]
    hwdb link esw1 Ethernet[1-4] esw2 Ethernet[1-4]
    hwdb port link i10esw1 Ethernet53 esw1 Ethernet5
    hwdb port link i10esw2 Ethernet53 esw2 Ethernet5
    # lags and speed
    hwdb port update --label 'speed=40000 lag=vpc10' i10esw[1-2] Ethernet53
    hwdb port update --label 'speed=40000 lag=vpc10' esw[1-2] Ethernet5
    
  • A tool which reads Provider netlist and converts all entries in hwdb commands

    Note

    A structured csv file with only descriptions usable in a specific sheet could be provided to reduce the disparities between the different manufacturers and allow a simplified production of hwdb commands

    Todo

    écrire un cahier des charges du besoin lié à netcc

Installation et configuration de confiture

  • Installation

    yum install -y confiture git emacs-nox vim vim-enhanced
    
  • Bootstrap confiture

    git init cluster
    cp -aR /usr/share/doc/confiture*/examples/* cluster/
    
  • Configurez l’URL vers la DB dans le confiture.yaml, les paths sont relatifs à l’emplacement de confiture.yaml

    # Starting configuration (/path/confiture/confiture.yaml)
    common:
        hiera_conf: hirea.yaml
        template_dir: templates/
        output_dir: output/
    
    dhcp:
        conf_name: dhcpd.conf
    
    dns:
        conf_name: named.conf
    
    racktables:
        url: 'mysql://racktables_user:MY_SECRET_PASSWORD@localhost/racktables_db'
    
  • Confiture Network Range

    Dans le fichier network.yaml, on définit les sous-réseaux associés à chacun des équipements:

    • 1 réseau bbone pour l’accès backbone pour les top et worker: A.B.C.0/24

    • 1 réseau eq pour l’accès et la surveillance des équipements: E.4.0.0/23

    • 1 réseau adm pour l’accès admin des noeuds de management: E.1.0.0/24

    • 1 réseau data pour l’accès aux données dans glusterfs: E.5.0.0/24

    • 1 réseau ipmi pour l’accès admin des noeuds de management: E.4.0.0/24

    Todo

    vérifier les définitions réseaux

    networks:
      # TOP Bbone network
      bbone:
        range: A.B.C.0/24
        interface: 'enp130s0f0'
        nameservers:
          - "${address('top1-bone')}"
        tftpservers:
          - "${address('top1-bone')}"
        bmgrservers:
          - "${address('top1-bone')}"
      # Vlan 1 ?
      eq:
        range: X.0.0.0/23
        interface: 'eno2'
        nameservers:
          - "${address('top1-eq')}"
        ntpservers:
          - "${address('top1-eq')}"
        tftpservers:
          - "${address('top1-eq')}"
        bmgrservers:
          - "${address('top1-eq')}"
      # Administration network
      # Vlan 1 ?
      adm:
        range: X.1.0.0/24
        interface: 'ens1'
        bmgrservers:
          - "${address('top1-adm')}"
        nameservers:
          - "${address('top1-adm')}"
        ntpservers:
          - "${address('top1-adm')}"
        tftpservers:
          - "${address('top1-adm')}"
      # Vlan 1 ?
      data:
        range: X.5.0.0/24
        interface: 'enp130s0f0'
        bmgrservers:
          - "${address('top1-adm')}"
        nameservers:
          - "${address('top1-adm')}"
        ntpservers:
          - "${address('top1-adm')}"
        tftpservers:
          - "${address('top1-adm')}"
      # BMC: physical network
      # Vlan 104 ?
      ipmi
        range: X.4.0.0/24
        interface: 'enp130s0f0'
        bmgrservers:
          - "${address('top1-adm')}"
        nameservers:
          - "${address('top1-adm')}"
        ntpservers:
          - "${address('top1-adm')}"
        tftpservers:
          - "${address('top1-adm')}"
      [...]
    

    Dans le fichier addresses.yaml, on associe des IP aux réseaux précédents. Exemple:

    addresses:
      top[1-3]:
        default: [adm,eq,bone,data,ipmi]
        bbone:   A.B.C.[1-3]
        eq       X.0.0.[1-3]
        adm:     X.1.0.[1-3]
        data:    X.5.O.[1-3]
        ipmi     A.B.C.[128-130]
      worker[1-3]:
        default: [adm,eq,bone,data,ipmi]
        bbone:   A.B.C.[4-6]
        eq       X.0.0.[4-6]
        adm:     X.1.0.[4-6]
        data:    X.5.O.[4-6]
        ipmi     A.4.0.[128-130]
      esw[1-2]:
        default: [adm]
        adm:     A.4.$(islet-id).[1-2]
      esw[4-6]:
        default: [adm]
        adm:     A.4.$(islet-id).[4-6]
    

Todo

A compléter avec les swicths et les islets?

Installation media preparation

On-site installation will require a traditional installation using external installation media. To guaranty that this media’s content match what we intend to install, we will generate it. To do this, we must have a node (virtual or not) running the OS we want to install.

Here, we will use the latest cloud-ocean pcocc image available. Other means can be used to launch the very same image (VirtualBox, libvirt, …).

Boot image

First step is to generate a boot image using lorax. This image will include a minimal OS and the anaconda installer. No other content (RPMs for instance) is included.

Install lorax:

yum install -y lorax

Generation currently requires the CentOS-os, CentOS-updates, CentOS-extras and Ocean repositories. Collect the required repository URLs:

yum repolist -v | grep baseurl

Launch lorax generation:

# Ocean major.minor version (2.x)
oswanted=2.6
# URL Ocean repo
yumsrv="http://pkg/mirror/pub/linux/ocean/"
lorax --isfinal -p Ocean -v ${oswanted} -r 1 \
  -s ${yumsrv}/${oswanted}/ocean/x86_64 \
  -s ${yumsrv}/${oswanted}/centos-os/x86_64 \
  -s ${yumsrv}/${oswanted}/centos-updates/x86_64 \
  -s ${yumsrv}/${oswanted}/centos-extras/x86_64 \
  -s ${yumsrv}/${oswanted}/epel/x86_64 \
  -s ${yumsrv}/${oswanted}/ocean/x86_64 \
  -s ${yumsrv}/${oswanted}/greyzone/x86_64 \
  /tmp/lorax_image

Installation repos

Now we have to include some content into the generated image. First, gather all the packages that might be required during the kickstart using yum:

mkdir -p /tmp/ocean_media/Packages/
yum install -y --installroot=/tmp/ocean_media/Packages/ --downloadonly --downloaddir=/tmp/ocean_media/Packages/ @core @base @anaconda-tools anaconda puppet puppet4 bridge-utils lsof minicom strace tcpdump vim emacs-nox bind-utils crash yum-utils
rm -Rf /tmp/ocean_media/Packages/var

If any other package is required it should be included here.

Recreate the yum groups using the CentOS’s comps.xml:

createrepo -g /dev/shm/packages/ocean_centos/comps.xml /tmp/ocean_media/

Media metadata

Mount and copy the content of the generated boot image:

mkdir /mnt/lorax_image /tmp/lorax_image_content
mount -o loop /tmp/lorax_image/images/boot.iso /mnt/lorax_image
rsync -avr /mnt/lorax_image/ /tmp/lorax_image_content
rm /tmp/lorax_image_content/isolinux/boot.cat

And now that we have all the bits to make the media, assemble everything:

mkisofs -o /tmp/ocean.iso -b isolinux/isolinux.bin -c isolinux/boot.cat -boot-load-size 4 -boot-info-table -no-emul-boot -eltorito-alt-boot -e images/efiboot.img -no-emul-boot -R -V "Ocean ${oswanted} x86_64" -T -graft-points isolinux=/tmp/lorax_image_content/isolinux images/pxeboot=/tmp/lorax_image_content/images/pxeboot LiveOS=/tmp/lorax_image_content/LiveOS EFI/BOOT=/tmp/lorax_image_content/EFI/BOOT images/efiboot.img=/tmp/lorax_image_content/images/efiboot.img .discinfo=/tmp/lorax_image/.discinfo .treeinfo=/tmp/lorax_image/.treeinfo Packages=/tmp/ocean_media/Packages repodata=/tmp/ocean_media/repodata
isohybrid --uefi /tmp/ocean.iso
implantisomd5 /tmp/ocean.iso
checkisomd5 /tmp/ocean.iso

Finally, try it out on a machine with qemu installed and X11 access:

qemu-system-x86_64 -m 1024 -smp 1 -cdrom ./ocean.iso

When validated, burn it on a DVD or on USB storage

RAID Configuration on top and worker

Using console plugged on each node, create:

  • 1 RAID1 named ‘system’ with the two first drives

  • 1 RAID10 named ‘data’ with all other drives

Initialize all RAID drives

Network definition

Todo

Guide on how to do RackTable insertion. Vlan & IP Partioning design. The result should be configuration files generated by confiture (DNS, DHCP, Switches).

Vlan and IP design guide

In an Ocean Stack cluster, the first requirement is that each islet must be in a independent set of VLANs. This is a requisite for three reasons : scalability, reliability and management ease.

This is because a cluster evolves. Adding or removing nodes must not affect the operational state of the cluster.

Because of this, the ethernet fabric design should be able to route between those VLANs in an effective way. The current best-pratice (documented N9K L3 Fabric architecture), uses a L3 fabric and the BGP protocol as a way to dynamically route IP traffic between islet.

A second requirement is to do an clear separation of IP Subnets that depends on node or equipment types. For example, in a compute islet, compute nodes and their related BMC should be in separate IP subnets. The same thing should be done for administrative nodes versus service IPs.

A best-practice is to do a hierarchical allocation of IP subnets that respect CIDR subnets. This makes the design of ACLs easier. For ex. having all the administrative allocations in the first “/13” subnets and all the service nodes in the second “/13” subnets.

An example of IP allocation could be :

10.0.0.0/16
├── 10.0.0.0/13 Management nodes and services
│ ├── 10.1.0.0/24 Central mngt servers
│ ├── 10.1.10.0/24 Islet10 mngt IPs
│ ├── 10.1.20.0/24 Islet20 mngt IPs
│ └── 10.3.0.0/24 Service IPs
├── 10.8.0.0/13 Service nodes and related equipments
│ ├── 10.8.0.0/24 Central service nodes
│ └── 10.8.20.0/24 Islet 20 service nodes
├── 10.16.0.0/13 User nodes and related equipments
│ └── 10.16.20.0/24 Islet 20 compute nodes
└── 10.32.0.0/11 Cluster equipments
└── 10.32.20.0/24 Islet 20 compute node BMCs

And with a VLAN mapping that ensures no equipments can spoof another equipment of another type:

VLANs mapping

VLAN

IP Subnet

A

10.1.0.0/24, 10.3.0.0/24

B

10.8.0.0/24

C

10.1.10.0/24

D

10.1.20.0/24

E

10.8.20.0/24

F

10.16.20.0/24,10.32.20.0/24

Virtual machine definition

Todo

Guide on how to do VM definition (Pcocc + Puppet) with ready-to-use examples for mandatory services.

Storage definition

Todo

Guide on how to design the GlusterFS cluster. May be limited to our way to use gluster (blocks of 3 servers)

Onsite Installation

Overview

The installation process is roughtly the following:

  • Install the base system on the first management node

  • Configure this node with all the components needed to deploy the other management nodes

  • Deploy the other management nodes using the management network

  • Deploy the Ethernet fabric (administration network)

  • Install and configure the Ocean components on those nodes using the temporary infrasctrure of the first node

  • Validate the final infrastructure

  • Redeploy and integrate the first node

Note that most configuration files will be already generated using confiture, the Ocean’s configuration generator.

When these step are all done, diskless or diskfull compute nodes can be deployed. Compute node hardware specifics are out-of scope of this document but some advises might be present.

Requirements

Management nodes should be configured with storage system ready to use. The name of thoses disks (viewed from the OS) will be required by BMGR for the kickstart process.

We advise a minimum of 60Gb RAID1 storage for the management node system. Data storage will depend on your hardware but hardware RAID controller are preffered over software ones.

Note

The top management nodes of our test-bed got 2 SATA-DOM in RAID1 (Intel Rapid Storage) and 10 disks in RAID10 (+2 Hot Spares), respectively viewed as Volume0_0 and sdc drives.

Default BIOS configuration will be just fine on most cases, we just need the following features to be activated (or deactivated):

  • SRIOV support activated

  • AES-NI support activated (not mandatory but advised)

  • Legacy boot only

  • BMC configured with DHCP (if they are cabled inside the cluster, at your discretion if not).

  • Energy saving features disabled (Fan profile, CPU profile, Energy efficient features, …)

  • Boot order: Network, CD/DVD, USB, system hard drives

  • Network boot devices, this setting might be handled by an option ROM:

    • Interface cabled onto bbone network for top worker nodes

    • Interface cabled onto management network for the other management nodes

    • Interface cabled onto administration network

Moreover, network switchs should be in factory configuration.

Note

To factory reset Cisco switch, in the management shell : erase startup-config and reload

Note

To factory reset Arista switch: in the Aboot shell (at boot time) : mv /mnt/flash/startup-config /mnt/flash/startup-config.old reboot or in a priviledged shell : erase startup-config and reload

Warning

Some switch have their ports disabled while port are going up (SpanningTree-related). Moreover, DHCP snooping may be enabled by default. To mitigate both issues, set the DHCP server are a trusted source port (ip dhcp snooping trust) and set server-facing ports as edge ports (or Cisco’s portfast, spanning-tree portfast)

This installation method also requires that Ocean’s repositories are reachable.

First node deployment

System installation

With the Ocean’s installation media burned on a USB key or DVD, boot the first node. Graphical installer is not available here as the textual installer is easier to document and to use in a console installation context.

If you have never done it, we advise to check the media using the “Test this media & Install Ocean ${oswanted}” boot option. It might take some time but gives confidence.

When the installer is launched and prompts you the main menu, you can now proceed with the configuration :

  1. Language setting: English (United Stated)

  2. Timezone : Europe/Paris

  3. Installation source : Local media (auto detected)

  4. Software selection : Minimal Install

  5. Installation destination: Use the whole system disk with LVM. The partitioning scheme doesn’t really matter here as we’ll reinstall this node soon.

  6. KDump: Enabled

  7. Network configuration: Configure the backbone interface in order to get a remote access. Also configure nameservers and hostnames.

  8. Root password: Configure a temporary root password

  9. User creation: No system user should be created

System pre-configuration

If you have anything to do after the installation but before rebooting, you can modify configuration in anaconda’s shell (switch with Alt+Tab). System is installed within /mnt/sysimage.

For instance, here we disable firewalld and SElinux and change ssh default port:

systemctl --root /mnt/sysimage disable firewalld
sed -i 's/^SELINUX=enforcing/SELINUX=disabled/' /mnt/sysimage/etc/selinux/config
sed -i 's/^#Port 22/Port 422/' /mnt/sysimage/etc/ssh/sshd_config

After the installation is complete, make sure the node is booting on the system disks and open a remote shell onto it.

System configuration

Anaconda installations enables some unwanted features like SELinux and firewalld. Make them inactive:

systemctl disable --now firewalld
setenforce Permissive
sed -i 's/^SELINUX=enforcing/SELINUX=disabled/' /etc/selinux/config

For security purposes, we strongly recommends the usage of the 422 port for SSH. To do so:

sed -i 's/^#Port 22/Port 422/' /etc/ssh/sshd_config
systemctl restart sshd

The installation process requires complete Ocean repos, configure them manually on the first node. You should have the following repos configured:

  • Ocean

  • Ocean-updates

  • Greyzone

  • Greyzone-updates

  • CentOS

  • CentOS-updates

  • CentOS-extras

  • EPEL

  • Ocean missing

  • Gluster

Note

An ocean.repo may be available at the root of your package repositories.

You may have to disable included repos using official CentOS mirrors to make yum work. Use the --disablerepo option to do so:

yum --disablerepo base,extras,updates makecache

Install some packages, disable permanently CentOS official repos and synchronize the system with the available packages:

yum -y --disablerepo base,extras,updates install yum-utils yum-plugin-priorities
yum-config-manager --disable base,extras,updates
yum distribution-synchronization
systemctl disable --now NetworkManager
yum remove -y NetworkManager\*

Network configuration

This node is connected to all available networks (backbone, management and administration). Backbone was configured by you in Anaconda’s text UI. Now configure all the internal networks using the following template and the addressing scheme designed in the off-site step:

# /etc/sysconfig/network-scripts/ifcfg-eno2
# Here eno2 is the management network, and is 10.0.0.1
DEVICE=eno2
BOOTPROTO=static
BROADCAST=10.0.0.255
IPADDR=10.0.0.1
NETMASK=255.255.255.0
NETWORK=10.0.0.0
ONBOOT=yes

You will most probably require some IP routes to be configured, if so don’t forget to set those in /etc/sysconfig/network-scripts/route-INTERFACE_NAME

Mellanox cards

If you’re using Mellanox VPI cards for 40G/50G/100G ethernet links, install the Mellanox OFED and load the drivers:

yum install -y mlnx-ofa_kernel kmod-mlnx-ofa_kernel ocean-fw-mlnx-hca infiniband-diags mstflint kmod-kernel-mft-mlnx unzip
systemctl start openibd

If needed, use the firmwares present in /usr/share/ocean-fw-mlnx-hca/firmware/ and the mstflint tool to burn your firmware:

unzip %FIRMWARE%.bin.zip
mstflint -d 81:00.0 -i %FIRMWARE%.bin burn

Methods to get the card PSID and OPN can be found in /usr/share/ocean-fw-mlnx-hca/release_notes/README.txt.

If needed and using the mstconfig tool, verify and set the link type to Ethernet (a link type of 2 means Ethernet):

mstconfig -d 81:00.0 query | grep LINK_TYPE
mstconfig -y -d 81:00.0 set LINK_TYPE_P1=2

After configuing Mellanox card to Eternet, Flexboot mecanism is activated and may take a long time to initialize 40G links. To deactivate Flexboot:

mstconfig -d 81:00.0 q LEGACY_BOOT_PROTOCOL EXP_ROM_PXE_ENABLE
mstconfig -y -d 81:00.0 set LEGACY_BOOT_PROTOCOL=NONE EXP_ROM_PXE_ENABLE=0

After a reboot, the card should appear as a ensX network device, and can be configured like the other interfaces.

MAC addresses gathering

If this is not done yet, here is a method to collect MAC addresses on the management network. We assume here that BMC are auto-configuring using DHCP.

Remember that some switches have some requirements (especially spanning-tree related) that have to be met. See Requirements for details.

Using SSH or a console cable, open a shell to the management network switch and display the ARP table.

Here we’re using a USB console cable on a Cisco Catalyst switch:

screen /dev/ttyACM1

Switch> show mac address-table

Using the displayed MAC/Port mapping, match them with the expected cabling (hwdb port list --local esw2), insert them into confiture’s data files and re-generate the DHCP configuration.

The shut/no shut trick may be applied on a switch port to force the equipment to relaunch the DHCP phase.

Note

Catalyst’s management interface don’t do DHCP by default, to activate it add ip address dhcp the management interface configuration (fastethernet0 in our case). Get the interface’s MAC with show interface fastethernet0

For the backbone network, you may not have access to the switch. As there is only 3 nodes that will boot over it a simple tcpdump while booting the node will do the job.

DHCP & Named installation

Using the node used for the off-site preparation phase, update confiture’s data with the discovered MACs, re-generate the dhcp configuration and import the dhcpd and named configuration files.

Put them in the right place and start bind and dhcpd.

Note

Some adjustements may have to be done on generated configuration. As a general rule, don’t modify generated files, modify templates and import back the generated files.

yum install -y dhcp bind bind-utils
systemctl enable --now named dhcpd

Now, configure the resolv.conf with yourself as a nameserver and verify that all BMC are now reachable.

LAN MACs gathering

Now gather the management node LAN interface MACs. To do so, either :

  • Make them boot on the network and collect the MACs:

    • Make sure that the interface is used by the BIOS for PXE (setting in BIOS menu or Option ROM)

    • Using IPMI set the next boot device to PXE:

      ipmitool -U %USER% -P %PASSWORD% -H %HOSTNAME% chassis power off
      ipmitool -U %USER% -P %PASSWORD% -H %HOSTNAME% chassis bootdev pxe
      ipmitool -U %USER% -P %PASSWORD% -H %HOSTNAME% chassis power on
      
    • Collectl the MACs on the switch or use tcpdump to capture DHCP requests:

      $ screen /dev/ttyACM1
      
      > show mac address-table
      
  • Use the BMC web interface on get the system’s LAN MAC address.

  • Use the BIOS or Option ROMs informations

  • On SuperMicro hardware, you can get the first LAN MACs by issuing the following IPMI raw command:

    ipmitool -U %USER% -P %PASSWORD% -H %HOSTNAME% raw 0x30 0x21 | tail -c 18 | tr ' ' ':'
    

With those MACs gathered, update confiture’s data and update the DHCP configuration with the freshly generated configuration.

BMGR installation

Install the BMGR tool:

yum install -y bmgr

Start and initialize the database:

systemctl enable --now mariadb
mysql << EOF
grant all privileges on bmgr.* to bmgr_user@'localhost' identified by 'bmgr_pass';
create database bmgr;
EOF
FLASK_APP=bmgr.app flask initdb

Add the WSGI entrypoint into Apache’s configuration file:

echo 'WSGIScriptAlias /bmgr "/var/www/bmgr/bmgr.wsgi"' >> /etc/httpd/conf/httpd.conf
systemctl enable --now httpd

Test with the CLI:

bmgr host list

Configuration

Create node profiles and assign weight to them:

bmgr profile add -w 0 ocean_mngt
bmgr profile add -w 5 ocean_mngt_top
bmgr profile add -w 10 ocean_mngt_top_1
bmgr profile add -w 5 ocean_mngt_worker
bmgr profile add -w 5 ocean_mngt_islet_worker

Add the cluster nodes and associated profiles into bmgr:

bmgr host add --profiles ocean_mngt,ocean_mngt_top,ocean_mngt_top_1 top[1-3]
bmgr host add --profiles ocean_mngt,ocean_mngt_worker worker[1-3]
bmgr host add --profiles ocean_mngt,ocean_mngt_islet_worker islet[10-11,20-21,...]

Add profile specificities:

# The names of the network interface are given as configuration examples (see section 'Network Configuration')
bmgr profile update ocean_mngt_top_1 -a netdev enp130s0f0 -a ks_drive Volume0_0
bmgr profile update ocean_mngt_worker -a netdev enp3s0f0 -a ks_drive Volume0_0
bmgr profile update ocean_mngt_islet_worker -a netdev eno1 -a ks_drive Volume0_0
bmgr profile update ocean_mngt -a console ttyS1,115200 -a ks_selinux_mode disabled -a ks_firewall_mode disabled -a ks_rootpwd root -a kickstart http://top1-mngt/bmgr/api/v1.0/resources/kickstart/

Note

This strongly depends on your hardware specificities, it may be convenient to create additionnal profiles.

For example, Cisco Nexus 9K Zero-touch provisionning can use bmgr features to autoconfigure itself. It is up to administrators to design profiles hierarchy and attributes. This is only an example used in our test bed.

Moreover, to help you bmgr can assign weights to individual profiles, giving them a higher priority.

Deployment server

Lorax image

Kickstart process will use a custom boot image, this image will be generated with the lorax tool.

Install lorax:

yum install -y lorax

Launch the build process, with the package repo URLs defined in the repo file:

lorax -p Ocean -v ${oswanted} -r 1 $(sed -ne 's/^baseurl=/-s /p' /etc/yum.repos.d/ocean.repo) /var/www/html/boot

Configure bmgr accordingly:

bmgr profile update ocean_mngt -a initrd http://top1-mngt/boot/images/pxeboot/initrd.img -a kernel http://top1-mngt/boot/images/pxeboot/vmlinuz -a install_tree http://top1-mngt/boot

Note

As top nodes may be deployed on a different physical network (backbone instead of internal network), bmgr and other configuration item may have to be duplicated between profiles. For example, for top nodes:

bmgr profile update ocean_mngt_top -a initrd http://top1-bbone/boot/images/pxeboot/initrd.img -a kernel http://top1-bbone/boot/images/pxeboot/vmlinuz -a install_tree http://top1-bbone/boot

Repositories

Kickstart process requires local repos, using reposync and createrepo create a temporary clone of CentOS repositories:

yum install -y createrepo
reposync -p /var/www/html/boot/packages -r centos-updates -r centos-os -r ocean -r ocean-updates -r ocean-missing -n -m
createrepo -g /var/www/html/boot/packages/centos-os/comps.xml /var/www/html/boot

Warning

Repository names (-r arguments) may differ

Warning

This will roughly use 12Gb in the /var filesystem

Package repository proxy

Using Apache, configure a proxy to your package repository:

cat > /etc/httpd/conf.d/mirror.conf << EOF
ProxyPass /mirror http://yumsrv.ccc.cea.fr/
ProxyPassReverse /mirror http://yumsrv.ccc.cea.fr/
EOF
systemctl reload httpd

Warning

Adapt the content of mirror.conf with your repository URL. This should point to some URL where all the repos are available as subdirectories.

Configure bmgr accordingly:

echo ${oswanted}
bmgr profile update ocean_mngt -a ks_repos http://top1-mngt/mirror/ocean/${oswanted}/ocean/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/ocean-updates/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/centos-os/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/centos-update/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/centos-extras/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/epel/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/greyzone/x86_64,http://top1-mngt/mirror/ocean/${oswanted}/greyzone-updates/x86_64

Admin SSH key

Generate a SSH key, this one will be used after the kickstart process is finished (as no password will be set):

ssh-keygen -b 4096
cp ~/.ssh/id_rsa.pub /var/www/html/authorized_keys
cp ~/.ssh/id_rsa.pub /root/.ssh/authorized_keys

Configure bmgr accordingly:

bmgr profile update ocean_mngt -a ks_authorized_keys_url http://top1-mngt/authorized_keys

TFTP server

A TFTP server is required for PXE chainloading. Install a TFTP server:

yum install -y xinetd tftp-server tftp
systemctl enable --now xinetd tftp

And make iPXE network boot loader images available through TFTP:

yum install -y ipxe-bootimgs
ln /usr/share/ipxe/{undionly.kpxe,ipxe.efi} /var/lib/tftpboot/

Warning

Symbolic links are not followed by TFTP server. Only use hardlinks or copy the file you want to serve.

DHCP update

Update the DHCP template and confiture’s data with deployment server specifics:

  • BMGR server URL

  • TFTP server IP

  • iPXE ROM name.

  • DNS IPs

Apply the configuration and restart the dhcp server.

Note

Some equipments may only support EFI rom, modify template to reflect this.

Worker nodes deployment

Now that we have everything required to kickstart a node, try to deploy the second node:

  • Double check that iPXE script and kickstart file are correct:

    bmgr resource render ipxe_deploy_boot top2
    bmgr resource render kickstart top2
    

    Note

    Kickstart template may be modified, it is located in /etc/bmgr/templates/ks_rhel7.jinja

  • Configure the BIOS with the settings mentioned above. Make sure that RAID device are present and correctly defined in kickstart file.

  • Set the next bootdev to PXE:

    ipmitool -U %USER% -P %PASS% -H %BMC% chassis bootdev pxe
    
  • Enable deploy mode in bmgr:

    bmgr alias override -o ipxe_boot ipxe_deploy_boot top2
    
  • Start it and monitor the process with a remote console (either SOL or console redirection):

    ipmitool -U %USER% -P %PASS% -H %BMC% chassis power on
    

When the node is fully kickstarted, it will be in a state where:

  • A minimal set of packages is installed

  • Proxied repos are configured

  • The interface used for deployment is configured. The other ones are not.

  • A ssh daemon is running

  • Root’s authorized_keys is deployed (with the given URL)

If you have Mellanox cards as a multi-gigabyte Ethernet card, you may have to flash and configure the same way as the first node, see Mellanox cards.

Make sure the storage you intend to use as a GlusterFS brick is available and ready-to-use. We strongly recommend a filesystem label to be set on the gluster block device. Use xfs_admin -L to set a label on a XFS filesystem.

Ethernet fabric configuration

Switch configuration

The Ethernet fabric configuration may be configured by 2 different methods:

  • Manual initial configuration and generated configuration deployment

  • Zero touch provisioning (ZTP for Arista, POAP for Cisco Nexus)

Zero touch provisionning in very specific to your hardware and may require third-party tools or servers. We will only document manual process in this general-purpose installation guide.

Note

Cisco POAP is documented in this annex: Cisco PowerOn Auto Provisioning

This process requires a manual step for the initial switch configuration. Connect to each switch using a serial console and set up remote access. This usually includes:

  • IP Address assignement on the management interface

  • Administrative user creation

  • Priviledged shell (aka enable mode) password setup.

  • Testing from a remote host

Using the configuration file generated with confiture, test the configuration bits with the real-world switch. If everything seems good deploy it entirely using the already deployed TFTP server or HTTP server.

Note

This step might be iterative : test on the switch, fix the confiture template, redeploy and so on.

Node configuration

Todo

Regarder sur les différents clusters si une information est utile ici

Management stack deployment

Puppet server installation

Now install puppet server and all required components on the first node:

yum install -y puppet4 puppetserver puppet-global puppet-extras puppet-addons git rubygem-r10k rubygem-hocon emacs-nox emacs-yaml-mode vim

Create puppet’s required git repos :

git clone --mirror /usr/share/puppet-global /var/lib/puppet-global
git init --bare /var/lib/puppet-cccenv
echo 'ref: refs/heads/production' > /var/lib/puppet-cccenv/HEAD
git init --bare /var/lib/puppet-domain
echo 'ref: refs/heads/production' > /var/lib/puppet-domain/HEAD

Clone them locally:

mkdir /root/puppet
cd /root/puppet
git clone /var/lib/puppet-global global
git clone /var/lib/puppet-cccenv cccenv
git clone /var/lib/puppet-domain domain

And bootstrap cccenv and domain repos:

cd /root/puppet/cccenv
mkdir -p modules/empty/manifests files hieradata
touch modules/empty/manifests/empty.pp
git add .
git commit -m 'Initial commit'
git branch -m master production
git push -u origin HEAD:production
cd /root/puppet/domain
mkdir -p files/$(facter domain)/{all-nodes,nodes,hieradata}
ln -sf ../files/$(facter domain)/hieradata hieradata/$(facter domain)
git add .
git commit -m 'Initial commit'
git branch -m master production
git push -u origin HEAD:production

Set the upstream origin in case of puppet-global update:

cd /root/puppet/global
git remote add upstream /usr/share/puppet-global

Set the commiter’s name and email for each repo:

git --git-dir /root/puppet/global/.git config --local user.name "Super Admin"
git --git-dir /root/puppet/global/.git config --local user.mail "super.admin@ocean"
git --git-dir /root/puppet/cccenv/.git config --local user.name "Super Admin"
git --git-dir /root/puppet/cccenv/.git config --local user.mail "super.admin@ocean"
git --git-dir /root/puppet/domain/.git config --local user.name "Super Admin"
git --git-dir /root/puppet/domain/.git config --local user.mail "super.admin@ocean"

Configure r10k manually, insert the following in /etc/puppetlabs/r10k/r10k.yaml:

---
:cachedir: /var/cache/r10k
:sources:
  :global:
    remote: /var/lib/puppet-global
    basedir: /etc/puppetlabs/code/environments
:deploy:
  purge_whitelist: [ ".resource_types/*", ".resource_types/**/*" ]

Deploy the repos with r10k:

r10k deploy environment -pv

Configure master’s ENC in /etc/puppetlabs/puppet/puppet.conf:

[master]
  node_terminus = exec
  external_nodes = /sbin/puppet-external

Start the puppetserver:

systemctl enable --now puppetserver

Set the current node (the first node) profile in /etc/puppet/puppet-groups.yaml:

environments:
  production: 'top1'
roles:
  puppetserver: 'top1'

Test and then apply this profile:

puppet-check -v --server $(facter fqdn)
puppet-apply -v --server $(facter fqdn)

Note

This will manage all the files and components required to launch a puppet server. The only unmanaged thing are the 3 repos in /var/lib/.

Note

Some warnings about missing augeas lenses may appear in puppet-check output, you can safely ignore them:

[...]
Augeas didn't load ... with Trapperkeep.lns
[...]

You now have a working puppet server.

Profile setup

Ocean’s includes a set a basic profiles that configures the management stack. Many of them requires configuration. Available profile are present in the hieradata/global folder of the global repo.

Warning

Following configuration files are only examples, adapt them with your deployment specificities

ClusterShell groups configuration

To have a convenient way to define nodes roles, define ClusterShell groups configuration this way:

sed -i -e 's/^default:.*/default: cluster/' /etc/clustershell/groups.conf

cat >/etc/clustershell/groups.d/cluster.yaml <<EOF
cluster:
  top: 'top[1-3]'
  worker: 'top[1-3],worker[1-3]'
  i_worker: 'islet[10-11,20-21]'
  puppetserver: 'top1'
  etcd: '@top,@worker,@i_worker'
  etcd_client: '@i_worker'
  fleet: '@top,@worker,@i_worker'
  gluster_server: '@top,@worker'
  gluster_client:  '@i_worker'
  pcocc_standalone: '@top,@worker,@i_worker'
  pcocc_standalone_top: '@top'
  mngt_top: '@top'
  mngt_common: '@top,@worker,@i_worker'
  all: '@top,@worker,@i_worker'
EOF

Profile dispatch

In the puppet-groups.yaml file, dispatch the profiles on the managements nodes:

environments:
  production: '@all'
roles:
  puppetserver: '@puppetsever'
  etcd: '@etcd'
  etcd_client: '@etcd_client'
  fleet: '@fleet'
  gluster_server: '@gluster_server'
  gluster_client: '@gluster_client'
  pcocc_standalone: '@pcocc_standalone'
  pcocc_standalone_top: '@pcocc_standalone_top'
  90_mngt_top: '@mngt_top'
  91_mngt_common: '@mngt_common'
  99-common: '@all'

Common configuration and resources

Most profile required a set of basic configuration and resources like network configuration. To do so, create common profiles in the domain repo and configure it using the following content :

# hieradata/91_mngt_common.yaml
resources:
 net::ifcfg:
   "%{hiera('adm_interface')}":
     mode: 'bridge'
     bridge: 'bradm'
     mtu: 9000
   bradm:
     mode: 'fromdns'
     type: 'Bridge'
     dnssuffix: '-adm'
     mask: '255.255.255.0'
     mtu: 9000
   "bradm:data":
     mode: 'fromdns'
     dnssuffix: '-data'
     mask: '255.255.255.0'
     bridge: 'brmngt'
   "%{hiera('mngt_interface')}":
     mode: 'bridge'
     bridge: 'brmngt'
   brmngt:
     mode: 'fromdns'
     type: 'Bridge'
     dnssuffix: '-mngt'
     mask: '255.255.255.0'
 #net::route:
 #  bradm:
 #    xtype: |
 #     content:10.3.0.0/24 via ADM_GATEWAY_IP
 #     10.5.0.0/24 via ADM_GATEWAY_IP'
# hieradata/90_mngt_top.yaml
resources:
 net::ifcfg:
   "%{hiera('bbone_interface')}":
     mode: 'bridge'
     bridge: 'brbone'
   brbone:
     mode: 'fromdns'
     type: 'Bridge'
     dnssuffix: '-bbone'
     mask: '255.255.255.0'
 #net::route:
 #  brbone:
 #    xtype: 'content:default via BBONE_GATEWAY_IP'

Warning

You will most probably require some routes to be set on backbone interface, to do so instantiate a net::route like in the comment of 90_mngt_top.yaml file`.

Note

Some node-specific variable must be included there, create a node hiera data file (like hieradata/top1.yaml for top1) in the domain repo. The following example specifies the administration IP and the fleet role for top1.

# hieradata/top1.yaml
adm_interface: 'ens1'
bone_interface: 'enp130s0f0'
mngt_interface: 'eno2'
fleet_role: 'top'

Same kind of variables can be set using profile-specific variables.

To create a new profile, create a hieradata file in domain or cccenv repo and assign them to nodes in the /etc/puppet/puppet-groups.yaml.

Commit your change:

git add hieradata
git commit -m "Common configuration"

Time synchronization

Because management daemons requires little clock skew, management nodes have to be synchronized.

Configure chrony daemons be synchronized on the first node:

server <upstream ntp server> iburst

# Record the rate at which the system clock gains/losses time.
driftfile /var/lib/chrony/drift

# Enable kernel RTC synchronization.
rtcsync

# In first three updates step the system clock instead of slew
# if the adjustment is larger than 10 seconds.
makestep 10 3

# Allow NTP client access from local network.
allow <your network>

logdir /var/log/chrony
acquisitionport 123

Do the first sync with ntpdate:

clush -bw top[2-3],worker[1-3] yum install -y ntpdate chrony
clush -bw top[2-3],worker[1-3] ntpdate top1-mngt.$(facter domain)

Configure chrony on nodes:

cat >/tmp/chrony.conf <<EOF
server top1-mngt.$(facter domain) iburst
driftfile /var/lib/chrony/drift
rtcsync
makestep 10 3
logdir /var/log/chrony
EOF
clush -bw top[2-3],worker[1-3] ntpdate top1-mngt.$(facter domain)

Etcd configuration

In the hieradata folder of the domain repo, configure the etcd profile accordingly:

# hieradata/etcd.yaml
etcd::initial_cluster:
  - 'top1=https://top1.%{::domain}:2380'
  - 'top2=https://top2.%{::domain}:2380'
  - 'top3=https://top3.%{::domain}:2380'
  - 'worker1=https://worker1.%{::domain}:2380'
  - 'worker2=https://worker2.%{::domain}:2380'
  - 'worker3=https://worker3.%{::domain}:2380'

Commit your change:

git add hieradata/etcd.yaml
git commit -m "Initial etcd configuration"

Fleet configuration

Same for fleet:

# hieradata/fleet.yaml
fleet::server::settings:
  etcd_servers: "[\"https://top1.%{::domain}:2379\", \"https://top2.%{::domain}:2379\", \"https://top3.%{::domain}:2379\", \"https://worker1.%{::domain}:2379\", \"https://worker2.%{::domain}:2379\", \"https://worker3.%{::domain}:2379\",]"
  etcd_username: 'root'
  etcd_password: 'password'
  public_ip: "%{::ipaddress_bradm}"
  metadata: "'hostname=%{::hostname},role=%{hiera('fleet_role')}'"
  enable_grpc: "true"

Commit your change:

git add hieradata
git commit -m "Initial fleet configuration"

Gluster configuration

Gluster profile requires some configuration and already mounted bricks. To do so, in the hieradata folder of the domain repo, configure the gluster_server profile and requirements:

# hieradata/gluster_server.yaml
resources:
  gluster::mount:
    '/volspoms1':
      volume: 'top1-data,top2-data,top3-data,worker1-data,worker2-data,worker3-data:/volspoms1'
    '/volspoms2':
      volume: 'top1-data,top2-data,top3-data,worker1-data,worker2-data,worker3-data:/volspoms2'
  file:
    '/gluster':
      ensure: 'directory'
      owner: 'root'
      group: 'root'
      mode: '0755'
      tag: 'gluster'
    '/gluster/brick1':
      ensure: 'directory'
      owner: 'root'
      group: 'root'
      mode: '0755'
      tag: 'gluster'
    '/gluster/brick2':
      ensure: 'directory'
      owner: 'root'
      group: 'root'
      mode: '0755'
      tag: 'gluster'
    '/gluster/brick3':
      ensure: 'directory'
      owner: 'root'
      group: 'root'
      mode: '0755'
      tag: 'gluster'
    '/gluster/brick4':
      ensure: 'directory'
      owner: 'root'
      group: 'root'
      mode: '0755'
      tag: 'gluster'
    '/volspoms1':
      ensure: 'directory'
      owner: 'root'
      group: 'root'
      mode: '0755'
      tag: 'gluster'
    '/volspoms2':
      ensure: 'directory'
      owner: 'root'
      group: 'root'
      mode: '0755'
      tag: 'gluster'
  mount:
    '/gluster/brick1':
      ensure: 'mounted'
      fstype: 'xfs'
      device: "/dev/mapper/gluster-brick1"
      options: 'defaults,noatime,auto'
      dump: '1'
      pass: '2'
      tag: 'gluster'
      require: 'File[/gluster/brick1]'
    '/gluster/brick2':
      ensure: 'mounted'
      fstype: 'xfs'
      device: "/dev/mapper/gluster-brick2"
      options: 'defaults,noatime,auto'
      dump: '1'
      pass: '2'
      tag: 'gluster'
      require: 'File[/gluster/brick2]'
    '/gluster/brick3':
      ensure: 'mounted'
      fstype: 'xfs'
      device: "/dev/mapper/gluster-brick3"
      options: 'defaults,noatime,auto'
      dump: '1'
      pass: '2'
      tag: 'gluster'
      require: 'File[/gluster/brick3]'
    '/gluster/brick4':
      ensure: 'mounted'
      fstype: 'xfs'
      device: "/dev/mapper/gluster-brick4"
      options: 'defaults,noatime,auto'
      dump: '1'
      pass: '2'
      tag: 'gluster'
      require: 'File[/gluster/brick4]'
  gluster::peer:
    "top1-data.%{::domain}":
      fqdn: "top1.%{::domain}"
      pool: 'production'
      require: 'Class[Gluster::Service]'
    "top2-data.%{::domain}":
      fqdn: "top2.%{::domain}"
      pool: 'production'
      require: 'Class[Gluster::Service]'
    "top3-data.%{::domain}":
      fqdn: "top3.%{::domain}"
      pool: 'production'
      require: 'Class[Gluster::Service]'
    "worker1-data.%{::domain}":
      fqdn: "worker1.%{::domain}"
      pool: 'production'
      require: 'Class[Gluster::Service]'
    "worker2-data.%{::domain}":
      fqdn: "worker2.%{::domain}"
      pool: 'production'
      require: 'Class[Gluster::Service]'
    "worker3-data.%{::domain}":
      fqdn: "worker3.%{::domain}"
      pool: 'production'
      require: 'Class[Gluster::Service]'
  gluster::volume:
    'volspoms1':
      replica: 3
      arbiter: 1
      options:
        - 'features.shard: true'
        - 'features.shard-block-size: 64MB'
        - 'nfs.disable: true'
        # Virt group
        - 'performance.quick-read: false'
        - 'performance.read-ahead: false'
        - 'performance.io-cache: false'
        - 'performance.low-prio-threads: 32'
        - 'network.remote-dio: enable'
        - 'cluster.eager-lock: enable'
        - 'cluster.quorum-type: auto'
        - 'cluster.server-quorum-type: server'
        - 'cluster.data-self-heal-algorithm: full'
        - 'cluster.locking-scheme: granular'
        - 'cluster.shd-max-threads: 8'
        - 'cluster.shd-wait-qlength: 10000'
        - 'user.cifs: false'
      require:
        - "Gluster::Peer[top1-data.%{::domain}]"
        - "Gluster::Peer[top2-data.%{::domain}]"
        - "Gluster::Peer[top3-data.%{::domain}]"
        - "Gluster::Peer[worker1-data.%{::domain}]"
        - "Gluster::Peer[worker2-data.%{::domain}]"
        - "Gluster::Peer[worker3-data.%{::domain}]"
        - 'Mount[/gluster/brick1]'
        - 'Mount[/gluster/brick2]'
        - 'Mount[/gluster/brick3]'
      bricks:
        # 1st node - 2nd node - 3nd node
        #   Data   -   Data   - Arbiter
        - "top1-data.%{::domain}:/gluster/brick1/data"
        - "top2-data.%{::domain}:/gluster/brick1/data"
        - "top3-data.%{::domain}:/gluster/brick1/data"
        # Data - Arbiter - Data
        - "top3-data.%{::domain}:/gluster/brick2/data"
        - "top1-data.%{::domain}:/gluster/brick2/data"
        - "top2-data.%{::domain}:/gluster/brick2/data"
        # Arbiter - Data - Data
        - "top2-data.%{::domain}:/gluster/brick3/data"
        - "top3-data.%{::domain}:/gluster/brick3/data"
        - "top1-data.%{::domain}:/gluster/brick3/data"
        # Data - Data - Arbiter
        - "worker1-data.%{::domain}:/gluster/brick1/data"
        - "worker2-data.%{::domain}:/gluster/brick1/data"
        - "worker3-data.%{::domain}:/gluster/brick1/data"
        # Data - Arbiter - Data
        - "worker3-data.%{::domain}:/gluster/brick2/data"
        - "worker1-data.%{::domain}:/gluster/brick2/data"
        - "worker2-data.%{::domain}:/gluster/brick2/data"
        # Arbiter - Data - Data
        - "worker2-data.%{::domain}:/gluster/brick3/data"
        - "worker3-data.%{::domain}:/gluster/brick3/data"
        - "worker1-data.%{::domain}:/gluster/brick3/data"
    'volspoms2':
      replica: 3
      options:
        - 'nfs.disable: true'
      require:
        - "Gluster::Peer[top1-data.%{::domain}]"
        - "Gluster::Peer[top2-data.%{::domain}]"
        - "Gluster::Peer[top3-data.%{::domain}]"
        - "Gluster::Peer[worker1-data.%{::domain}]"
        - "Gluster::Peer[worker2-data.%{::domain}]"
        - "Gluster::Peer[worker3-data.%{::domain}]"
        - 'Mount[/gluster/brick4]'
      bricks:
        - "top1-data.%{::domain}:/gluster/brick4/data"
        - "top2-data.%{::domain}:/gluster/brick4/data"
        - "top3-data.%{::domain}:/gluster/brick4/data"
        - "worker1-data.%{::domain}:/gluster/brick4/data"
        - "worker2-data.%{::domain}:/gluster/brick4/data"
        - "worker3-data.%{::domain}:/gluster/brick4/data"

You need to prepare the bricks on each node:

pvcreate --dataalignment 768k /dev/sdb
vgcreate --physicalextentsize 768K gluster /dev/sdb
lvcreate --thin gluster/thin_pool --extents 100%FREE --chunksize 256k --poolmetadatasize 16G --zero n
lvcreate --thin --name brick1 --virtualsize 1.25t gluster/thin_pool
lvcreate --thin --name brick2 --virtualsize 1.25t gluster/thin_pool
lvcreate --thin --name brick3 --virtualsize 1.25t gluster/thin_pool
lvcreate --thin --name brick4 --virtualsize 2.5t gluster/thin_pool
mkfs.xfs -f -d su=256k,sw=1 -s size=512 /dev/mapper/gluster-brick1
mkfs.xfs -f -d su=256k,sw=1 -s size=512 /dev/mapper/gluster-brick2
mkfs.xfs -f -d su=256k,sw=1 -s size=512 /dev/mapper/gluster-brick3
mkfs.xfs -f -d su=256k,sw=1 -s size=512 /dev/mapper/gluster-brick4

Note

Here we assumed 6 × 960GB drives configured in a RAID 10. Have a look at the original RedHat procedure here: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html/administration_guide/brick_configuration

Also add gluster_client profile configuration:

# hieradata/gluster_client.yaml
resources:
  file:
    '/volspoms1':
      ensure: 'directory'
      owner: 'root'
      group: 'root'
      mode: '0755'
      tag: 'gluster'
    '/volspoms2':
      ensure: 'directory'
      owner: 'root'
      group: 'root'
      mode: '0755'
      tag: 'gluster'
  gluster::mount:
    '/volspoms1':
      volume: 'top1-data,top2-data,top3-data,worker1-data,worker2-data,worker3-data:/volspoms1'
    '/volspoms2':
      volume: 'top1-data,top2-data,top3-data,worker1-data,worker2-data,worker3-data:/volspoms2'

Commit your change:

git add hieradata
git commit -m "Initial gluster configuration"

Pcocc configuration

Pcocc profile requires some configuration. To do so, in the hieradata folder of the domain repo, configure the pcocc_standalone and pcocc_standalone_top profiles and requirements:

# hieradata/pcocc_standalone.yaml
pcocc::config::batch::etcd_servers:
  - "top1.%{::domain}"
  - "top2.%{::domain}"
  - "top3.%{::domain}"
  - "worker1.%{::domain}"
  - "worker2.%{::domain}"
  - "worker3.%{::domain}"

pcocc::config_etcd_pwd_xtype: 'content:password'

pcocc::config::repos::repos:
  - name: volspoms
    path: /volspoms1/pcocc

pcocc::config::networks:
  adm:
    type: bridged
    host_bridge: "bradm"
    tap_prefix: "admtap"
    mtu: 9000
  mngt:
    type: bridged
    host_bridge: "brmngt"
    tap_prefix: "mntap"
    mtu: 1500

pcocc::config::templates:
  generic:
    resource_set: 'default'
    user_data: '/etc/pcocc/cloudinit/generic.yaml'
    image: 'volspoms:cloud-ocean2.6'
# hieradata/pcocc_standalone_top.yaml
pcocc::config::networks:
  bone:
    type: bridged
    host_bridge: "brbone"
    tap_prefix: "bbtap"
    mtu: 1500

Commit your change:

git add hieradata
git commit -m "Initial Pcocc configuration"

Profile application

Push and deploy all your changes to the puppetserver:

git --git-dir /root/puppet/global/.git push
git --git-dir /root/puppet/domain/.git push
git --git-dir /root/puppet/cccenv/.git push
r10k deploy environment -pv

Puppet agents bootstrap

Install puppet4 and puppet-addons on the management nodes:

clush -bw top[2-3],worker[1-3] yum install -y puppet4 puppet-addons

Bootstrap SSL certificates with a puppet-check:

clush -bw top[2-3],worker[1-3] puppet-check --tags net --server top1.$(facter domain)
clush -bw top[2-3],worker[1-3] -R exec puppet cert sign %h.$(facter domain)

Network configuration

Warning

Interface names will most probably not match the ones you will have. Please make sure you have reset the right interface. Here interface are named as follows:

Host

Backbone

Management

Administration

top1

enp130s0f0

eno2

ens1

top2

enp130s0f0

eno2

ens1

top3

enp3s0f0

enp3s0f1

ens6f0

worker1

enp3s0f0

ens6f0

worker2

enp3s0f0

ens6f0

worker3

enp3s0f0

ens6f0

Network configuration (and bridges) require a bit more work to apply them. Here, we’ll apply configuration files and then restart interfaces.

First, apply network configuration files:

clush -bw top[1-3],worker[1-3] puppet-apply --tags net --server top1.$(facter domain)

Double-check that all node have their network configuration file correctly set :

clush -bw top[1-3],worker[1-3] 'more /etc/sysconfig/network-scripts/ifcfg-* | cat'

This will only change files, no restart is done. Make sure you have connectivity on all networks:

clush -bw top[1-3],worker[1-3] uname
clush -bw top[1-3]-bbone uname
clush -bw worker[1-3]-mngt uname

Stop the management interfaces:

clush -bw top[1-2] ifdown eno2
clush -bw top3 ifdown enp3s0f1
clush -bw worker[1-3] ifdown enp3s0f0

Start the management bridge:

clush -bw top[1-3],worker[1-3] ifup brmngt

Start the management interfaces:

clush -bw top[1-2] ifup eno2
clush -bw top3 ifup enp3s0f1
clush -bw worker[1-3] ifup enp3s0f0

And now make sure that management nodes are still reachable through all networks:

clush -bw top[1-3],worker[1-3] uname
clush -bw top[1-3]-bbone,worker[1-3]-bbone-mngt uname

Using the same technique, restart administration network to make sure all interfaces are bridged correctly:

# Stop administration interfaces
clush -bw top[1-2] ifdown ens1
clush -bw top3,worker[1-3] ifdown ens6f0
# Start administration bridges
clush -bw top[1-3],worker[1-3]-mngt ifup bradm
# Start administration interfaces
clush -bw top[1-2] ifup ens1
clush -bw top3,worker[1-3]-mngt ifup ens6f0

Make sure once more that everything went fine:

clush -bw top[1-3],worker[1-3] uname
clush -bw top[1-3],worker[1-3]-mngt uname

Using the same technique, restart backbone network to make sure all interfaces are bridged correctly:

# Stop backbone interfaces
clush -bw top2 ifdown enp130s0f0
clush -bw top3 ifdown enp3s0f0
# Start backbone bridges
clush -bw top[2-3] ifup brbone
# Start backbone interfaces
clush -bw top2 ifup enp130s0f0
clush -bw top3 ifup enp3s0f0

By using top2 as a relay, do the same thing on the first node (make sure you don’t deconnect yourself !), :

ifdown enp130s0f0; ifup brbone; ifup enp130s0f0

Gluster cluster bootstrap

After making sure that all nodes can reach other ones using the -data suffix, apply the gluster_server profile:

clush -bw top[1-3],worker[1-3] puppet-apply-changes --tags gluster --server top1.$(facter domain)

This should have installed gluster packages, created and mounted the /gluster that will contain all gluster bricks.

And then apply a second time only on one node, this step will setup gluster peers and volumes:

puppet-apply-changes --tags gluster --server top1.$(facter domain)

And launch it a third time to mount it on all nodes:

clush -bw top[1-3],worker[1-3] puppet-apply-changes --tags gluster --server top1.$(facter domain)

Note

In case of failures launch the puppet run with a puppet-apply -v instead of puppet-apply-changes and investigate for configuration errors.

Now, Gluster should be up and running:

clush -bw top[1-3],worker[1-3] df -ht fuse.glusterfs

Apply quota to the 2 volumes:

gluster volume quota volspoms1 enable
# For a 5.5TB quota (bash needs integers, so 11/2)
gluster volume quota volspoms1 limit-usage / $((1024*1024*1024*1024*11/2)) 95
gluster volume quota volspoms2 enable
# For a 1TB quota
gluster volume quota volspoms2 limit-usage / $((1024*1024*1024*1024))

In order to have the quotas working correctly you need at least one file (even an empty one) on each gluster filesystem.

Now the quotas should be ok:

gluster volume quota volspoms1 list
gluster volume quota volspoms2 list

Etcd cluster bootstrap

Apply the profile on all nodes:

clush -bw top[1-3],worker[1-3] puppet-apply-changes --tags etcd --server top1.$(facter domain)

Check that the cluster bootstraped correctly:

etcdctl -C https://$(facter fqdn):2379 cluster-health

Activate authentication and remove default roles:

# etcdctl -C https://$(facter fqdn):2379 user add root
New password:
User root created
# etcdctl -C https://$(facter fqdn):2379 auth enable
Authentication Enabled
# etcdctl -C https://$(facter fqdn):2379 -u root role remove guest
Password:
Role guest removed

Check that user root is the only user left:

etcdctl -C https://$(facter fqdn):2379 -u root user list
Password:
root

Check that everything is still correct with:

etcdctl -C https://$(facter fqdn):2379 -u root cluster-health

Fleet cluster bootstrap

Apply the profile on all nodes:

clush -bw top[1-3],worker[1-3] puppet-apply-changes --tags fleet --server top1.$(facter domain)

Check that the cluster bootstraped correctly:

fleetctl list-machines

Pcocc configuration

Note

PCOCC documentation and tutorials can be found here: https://pcocc.readthedocs.io/en/latest/index.html

Apply the profile on all nodes:

clush -bw top[1-3],worker[1-3] puppet-apply-changes --tags pcocc --server top1.$(facter domain)

Check that pcocc is configured correctly:

clush -bw top[1-3],worker[1-3] pcocc internal setup init
clush -bw top[1-3],worker[1-3] pcocc internal setup cleanup

Service VM installation

Most (if not all) ocean’s services are hosted within service VM. Each of them hosts some roles which are configured using puppet profiles. As an example, we consider the following VMs:

madmin

Being the central administration node, no real service is hosted here

VM bootstrap

Adding a new VM in an Ocean cluster is a three step process:

  • Allocate and configure an DHCP-provided IP

  • Bootstrap and configure the VM

  • Launch it into the highly-available launch system

VM allocation

Launch and open a shell into the site-prep virtual machine. Inside confiture configuration directory, allocate new IPs in hiera/addresses.yaml and MACs in hiera/hwaddrs.yaml.

Tip

MAC allocations may use the locally-managed MAC range. We advise to use 52:54:XX:00:YY on each network ie. 52:54:00:00:00 for the first VM on one network, 52:54:01:00:00 for first VM on the other network.

clush may ease this step:

clush -w vm[0-20] -R exec 'printf \"52:54:01:00:%%02x\" %n'

Generate the DNS and DHCP configuration:

confiture dhcp
confiture dns

Deploy the generated files on the first node.

VM Configuration

Insert into domain’s hieradata the VM definition using the following template :

# .ssh/authorized_keys content
vm_authorized_keys:
  - "ssh-rsa %% PUB KEY CONTENT %%"
vm_yum_repos:
  - 'name': 'centos-os'
    'url': 'http://ocean'
    'priority': '60'
  - 'name': 'centos-update'
    'url': 'http://ocean'
    'priority': '50'
resources:
  pcocc::standalone::vm:
    VM_NAME:
      reference_image_name: 'volspoms:cloud-ocean2.6'
      fleet: true
      resource_set: 'service'
      cpu_count: 4
      mem_per_cpu: 1000
      ethernet_nics:
        adm: 'ALLOCATED_MAC_ADM'
        bbone: 'ALLOCATED_MAC_BBONE'
      ssh_authorized_keys: "%{alias('vm_authorized_keys')}"
      yum_repos: "%{alias('vm_yum_repos')}"
      persistent_drive_dir: '/volspoms1/pcocc/persistent_drives'
      persistent_drives:
        - '/volspoms1/pcocc/persistent_drives/VM_NAME.qcow2'

Generate VM’s puppet SSL certificate on the puppet master (the first node so far) and add it into the domain repo:

#default VM_NAME="admin1 admin2 batch1 batch2 db1 i0con1 i0conf2 i0log1 infra1 infra2 lb1 lb2 monitor1 ns1 ns2 ns3 nsrelay1 webrelay1 webrelay2"
puppet cert generate $VM_NAME.$(facter domain)
mkdir -p /root/puppet/domain/nodes/$VM_NAME/etc/puppetlabs/puppet/ssl/{certs,private_keys}
cp /etc/puppetlabs/puppet/ssl/certs/$VM_NAME.pem /root/puppet/domain/nodes/$VM_NAME/etc/puppetlabs/puppet/ssl/certs/
cp /etc/puppetlabs/puppet/ssl/private_keys/$VM_NAME.pem /root/puppet/domain/nodes/$VM_NAME/etc/puppetlabs/puppet/ssl/private_keys/

Generate SSH host keys and add those into the domain repo:

mkdir -p /root/puppet/domain/nodes/${VM_NAME}/etc/ssh
ssh-keygen -t dsa -P "" -f /root/puppet/domain/nodes/${VM_NAME}/etc/ssh/ssh_host_dsa_key
ssh-keygen -t rsa -P "" -f /root/puppet/domain/nodes/${VM_NAME}/etc/ssh/ssh_host_rsa_key
ssh-keygen -t ecdsa -P "" -f /root/puppet/domain/nodes/${VM_NAME}/etc/ssh/ssh_host_ecdsa_key
ssh-keygen -t ed25519 -P "" -f /root/puppet/domain/nodes/${VM_NAME}/etc/ssh/ssh_host_ed25519_key
#
cd /root/puppet/domain/nodes/
git add -A .

Assign roles to the newly created VM in the ENC configuration file:

# /etc/puppet/puppet-groups.yaml
environments:
  production: 'VM_NAME'
roles:
  90-common: 'VM_NAME'

Commit and apply the configuration :

git commit
git --git-dir /root/puppet/domain/.git push
r10k deploy environment -pv
puppet-check -v --server $(facter fqdn)
puppet-apply -v --server $(facter fqdn)

To launch the VM manually, link fleet’s unit file into systemd search path and start the service:

systemctl link /etc/fleet/units/pcocc-vm-VM_NAME.service
systemctl start pcocc-vm-VM_NAME.service

You can follow the bootstrap process using the pcocc console command:

pcocc console -J VM_NAME vm0

First node reinstallation

Todo

  • Stack validation

  • First node reinstallation/reintegration

Etcd reintegration

It may happen that gluster or etcd looses their cluster environment leading to leave the node top1 alone.

Etcd does not start and in etcd log you can find a line that contains something like member cd99d5c27dc7998b has already been bootstrapped.

One way to recover this situation is to remove the old member top1 from the cluster by doing the following :

  • Get the id of the node to be removed by listing members of the cluster.

  • Remove the etcd data dir on top1 machine that has been reinstalled.

  • Change in /etc/etcd/etcd.conf the variable value of ETCD_INITIAL_CLUSTER_STATE from new to existing (still on top1 machine).

  • Add the new member to the cluster using the same functionnal endpoint.

  • Start etcd service on top1.

You will need etcd root password that should be stored in puppet when running some of commands below. All those commands are to be executed on top1 machine:

# etcdctl --endpoint https://top2.mg1.cloud.domain.fr:2379 member list
# etcdctl --endpoint https://top2.mg1.cloud.domain.fr:2379 -u root member remove cd99d5c27dc7998b
# mv /var/lib/etcd/top1.etcd /root/
# sed -i -e 's/ETCD_INITIAL_CLUSTER_STATE=\"new\"/ETCD_INITIAL_CLUSTER_STATE=\"existing\"/' /etc/etcd/etcd.conf
# etcdctl --endpoint https://top2.mg1.cloud.domain.fr:2379 -u root member add top1 https://top1.mg1.cloud.ccc.cea.fr:2380
# systemctl start etcd

One may check that the cluster is healthy:

# etcdctl --endpoint https://top2.mg1.cloud.domain.fr:2379 cluster-health
member 3910b3c7ca5407b is healthy: got healthy result from https://top1.mg1.cloud.domain.fr:2379
member 5c4493d902893be8 is healthy: got healthy result from https://top3.mg1.cloud.domain.fr:2379
member 647b167693a963df is healthy: got healthy result from https://top2.mg1.cloud.domain.fr:2379
cluster is healthy
Gluster reintegration

While bricks should not have been impacted by top1 node reinstallation recovering gluster’s cluster consists in getting the old gluster uuid from the cluster and then readd it to the new top1 machine. First we need to know the old uuid of top1 by runnning the following command on the remaining gluster nodes:

# gluster peer status
Number of Peers: 2

Hostname: top3-data.mg1.cloud.domain.fr
Uuid: 9acf71b8-263b-4056-8a2a-d4051911487c
State: Peer in Cluster (Connected)
Other names:
top3-data

Hostname: top1-data.mg1.cloud.domain.fr
Uuid: 5dd0b3f7-ab7e-4d9a-98f2-05395eec7891
State: Peer Rejected (Connected)

Now, on top1 stop the glusterd daemon and edit /var/lib/glusterd/glusterd.info to replace the uuid there with the old one:

# systemctl stop glusterd
# sed -i -e 's/UUID=.*/UUID=5dd0b3f7-ab7e-4d9a-98f2-05395eec7891' /var/lib/glusterd/glusterd.info

Retrieve peers information files from other nodes and delete the one that correponds to top1:

# mkdir /root/peers
# scp top2:/var/lib/glusterd/peers/* /root/peers/
# scp top3:/var/lib/glusterd/peers/* /root/peers/
# rm /root/peers/5dd0b3f7-ab7e-4d9a-98f2-05395eec7891

Then copy this peers information files into /var/lib/glusterd/peers/:

# cp /root/peers/* /var/lib/glusterd/peers/

Restart gluster daemon:

# systemctl start glusterd

Then start healing the volumes:

# gluster volume heal volspoms1 full
# gluster volume heal volspoms2 full

To see self-heal status of the volume you may execute the following commands:

# gluster volume heal volspoms1 info
# gluster volume heal volspoms2 info

When “Number of entries:” is at 0 on every brick then everything is Ok.

Reference: `Replacing a Host Machine with the Same Hostname<https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/sect-replacing_hosts#Replacing_a_Host_Machine_with_the_Same_Hostname>`_.

Todo

  • Proposer un ordre d’installation de VM ?

  • Ne pas oublier dans ntp de supprimer top1 comme serveur de référence si cela a été fait

Todo

Réintégrer ce qui suit dans une partie plus générale concernant Puppet.

Classes principales affectant les différentes VMs

admin1 admin2 batch1 batch2 db1 i0con1 i0conf2 i0log1 infra1 infra2 lb1 lb2 monitor1 ns1 ns2 ns3 nsrelay1 webrelay1 webrelay2

ns:

dns_client
dns_server
gluster_client
ldap_server
log_client
log_client_islet0
mail_client
monitored_server
ntp_client

nsrelay:

dns_client
gluster_client
ldap_fuse
log_client
mail_client
monitored_server
ntp_client

webrelay:

dns_client
gluster_client
log_client
log_client_islet0
mail_client
monitored_server
ntp_client
webrelay

ilog:

conman_server
conman_server_islet0
dns_client
gluster_client
log_client
log_client_islet0
mail_client
monitored_server
ntp_client

infra:

dhcp_server
dns_client
gluster_client
log_client
log_client_islet0
mail_client
monitored_server
ntp_client
tftp_server

db:

auks_server
clary_server
dns_client
gluster_client
ldap_client
log_client
log_client_islet0
mail_client
monitored_server
ntp_client
slurm_db

lb:

dns_client
dns_server
gluster_client
haproxy_server
haproxy_server_http
haproxy_server_ldap
haproxy_server_puppet
log_client
log_client_islet0
mail_client
monitored_server
ntp_client

i0conf:

bmgr
bmgr_server
dns_cache
dns_client
dns_server
gluster_client
log_client
log_client_islet0
mail_client
monitored_server
ntp_client
puppetserver
puppetserver_islet0
racktables
webserver
webserver_islet0

Compute node installation

Todo

Flash compute racks, configure switches, generate diskless images

Routers, logins & other service nodes

Todo

Flash services nodes, configure additional switches, kickstart nodes

Interconnect fabric configuration

Todo

OpenSM/BXI AFM installation and configuration