Nexus 9000 Dynamic routing migration
====================================

This handbook is here to explain the migration processus between legacy IP routing (as designed initially) and the -better- dynamic routing architecture.

We'll call **top switches** both Nexus 9336 and **islet switches** the Nexus 93180 pairs.

Source & target design
----------------------

As initially discussed, the IP routing architecture is designed so there is a single hop between each and any node in the ethernet fabric. This is seen as a performance optimization.

But this design introduces such a configuration overhead that it alienates most of configuration changes and make debugging not so natural.

Currently **top switches** are present in all VLANs and **islet switches** have a set of islet-specific VLANs and the VLAN 1 in common with **top switches**.

.. nwdiag::

   nwdiag{
     group {
       isletA
     }
     group {
       isletB
     }
     group {
       eswA
     }
     group{
       eswB
     }
     group {
       eswT
     }
     network top_vlans{
       address = "10.1.0.0/23"
       label = "Management VLANs"
       top
       isletA
       isletB
       eswT
       eswA
       eswB
     }
     network islet10_vlan{
       address = "10.X.10.0/24"
       label = "Islet 10 VLANs"
       eswA
       eswT
     }
     network islet20_vlan{
       address = "10.X.20.0/24"
       label = "Islet 20 VLANs"
       eswB
       eswT
     }
   }


.. nwdiag::

   nwdiag{
     group {
       isletA
     }
     group {
       isletB
     }
     group {
       eswA
     }
     group{
       eswB
     }
     group {
       eswT
     }
     network top_vlans{
       address = "10.1.0.0/23"
       label = "Management VLANs"
       top
       eswT
     }
     network islet10_vlan{
       address = "10.X.10.0/24"
       label = "Islet 10 VLANs"
       isletA
       eswA
     }
     network islet20_vlan{
       address = "10.X.20.0/24"
       label = "Islet 20 VLANs"
       isletB
       eswB
     }
     network ethernet_fabric{
       address = "169.254.0.0/16"
       label = "Transit VLAN"
       eswT
       eswA
       eswB
     }
   }

Using this design, IP routing is asymetric:

* Communications initiated from islets to tops are routed by islet switches
* Communications initiated from tops to islets are routed by top switches


The target design uses a new VLAN as a transit VLAN and locally managed subnet are announced on this VLAN using a dynamic routing protocol like BGP.

.. nwdiag::

   nwdiag{
     group {
       isletA
     }
     group {
       isletB
     }
     group {
       eswA
       eswB
     }
     group {
       eswT
     }
     network top_vlans{
       address = "10.1.0.0/23"
       label = "Management VLANs"
       top
       eswT
     }
     network ethernet_fabric{
       address = "169.254.0.0/16"
       label = "Transit VLAN"
       eswT
       eswA
       eswB
     }
     network islet10_vlan{
       address = "10.X.10.0/24"
       label = "Islet 10 VLANs"
       isletA
       eswA
     }
     network islet20_vlan{
       address = "10.X.20.0/24"
       label = "Islet 20 VLANs"
       isletB
       eswB
     }
   }


This is sub-optimal in terms of number of routing hops. But the configuration is much simpler in this case.

The legacy IP routing design as been taken into account in the IP allocation scheme : because VLAN 1 is shared between top managements node and islets nodes, they are in the same subnet.

As such, one the requirements is the reallocation of all subnets that are shared between **top switches** VLANs and **islet switches** VLANs.

Migration process
-----------------

Transit VLAN
^^^^^^^^^^^^

Allocate and deploy the transit VLAN and the associated IP allocation scheme. Which can be done with the following change in **confiture**'s configuration.

.. code-block:: diff

		commit 67da4312a4a97d698c33e252f8a34034670980ac
		Author: John Doe <john.doe@noreply.fr>
		Date:   Tue Oct 6 16:57:45 2020 +0200

		    Add VLAN 3 (Transit)

		diff --git a/hiera/addresses.yaml b/hiera/addresses.yaml
		index dcdcc6d..51c33dc 100644
		--- a/hiera/addresses.yaml
		+++ b/hiera/addresses.yaml
		@@ -944,6 +944,7 @@ addresses:
		       eq: 10.0.0.[128-163] #/24
		       adm: 10.1.0.[128-163] #/24
		       eqs: 10.32.0.[128-163] #/24
		+      transit: 169.254.0.[1-36]
		       critical: True
		 
		     ## Infiband top switches (manageable ones)
		diff --git a/hiera/switches.yaml b/hiera/switches.yaml
		index 044d891..aed1de6 100644
		--- a/hiera/switches.yaml
		+++ b/hiera/switches.yaml
		@@ -21,6 +21,9 @@ switches:
		             2:
		               addresses:
		                 - "${address('gw-common-equipment-core1')}/24"
		+            3:
		+              addresses:
		+                - "${address('esw1-transit')}/16"


By deploying this on the **top switches** and **islet switches** you can now make them communicate on this VLAN.

.. code-block:: console

     # clush -bw esw[1,2,12-13,38-58/4,39-59/4] "ping 169.254.0.1"
     ---------------
     esw1
     ---------------
     PING 169.254.0.1 (169.254.0.1): 56 data bytes
     64 bytes from 169.254.0.1: icmp_seq=0 ttl=255 time=0.357 ms
     64 bytes from 169.254.0.1: icmp_seq=1 ttl=255 time=0.298 ms
     64 bytes from 169.254.0.1: icmp_seq=2 ttl=255 time=0.249 ms
     64 bytes from 169.254.0.1: icmp_seq=3 ttl=255 time=0.403 ms
     64 bytes from 169.254.0.1: icmp_seq=4 ttl=255 time=0.306 ms

     --- 169.254.0.1 ping statistics ---
     5 packets transmitted, 5 packets received, 0.00% packet loss
     round-trip min/avg/max = 0.249/0.322/0.403 ms
     [...]

.. warning::

   If you're using ACLs, don't forget to add the according rule. In this case::

     object-group ip address transit_subnet
       10 169.254.0.0/16
     ip access-list ROUTING_ACL
       400 permit ip addrgroup transit_subnet addrgroup transit_subnet vlan 3


BGP Sessions
^^^^^^^^^^^^

Create and start BGP sessions on **top switches** and **islet switches**. The resulting configuration items is the following on Nexus 9000 switches.

On **top switches**::

     feature bgp
     route-map ROUTE_MAP permit 10
       match as-number 4294967294
     router bgp 4294967294
       address-family ipv4 unicast
         redistribute direct route-map ROUTE_MAP
         maximum-paths ibgp 10
     router bgp 4294967294
       neighbor 169.254.0.0/16
         update-source Vlan3
         remote-as 4294967294
         address-family ipv4 unicast
           route-reflector-client

On **islet switches**::

     feature bgp
     route-map ROUTE_MAP permit 10
       match as-number 4294967294
     router bgp 4294967294
       address-family ipv4 unicast
         redistribute direct route-map ROUTE_MAP
         maximum-paths ibgp 10
     router bgp 4294967294
       neighbor 169.254.0.1
         update-source Vlan3
         remote-as 4294967294
         address-family ipv4 unicast
           next-hop-self
       neighbor 169.254.0.2
         update-source Vlan3
         remote-as 4294967294
         address-family ipv4 unicast
           next-hop-self


This configuration creates iBGP peering between **islets switches** and **top switches**. And **top switches** are configured to redistribute leaned routes to other clients (iBGP route reflection).

Using this configuration, any subnet directly connected to any of the switches will be reachable from others.

IP Scheme
^^^^^^^^^

Add and allocate the new IP scheme : islet management nodes (and switches) have their own subnets (service, data and management subnets).

For example, the **islet** management node IP allocation has been updated as follows:

.. code-block:: diff

       adm:  10.1.0.[10-13,20-39,42-43,46-47,50-51,54-55,58-59] #/23
       data: 10.5.0.[10-13,20-39,42-43,46-47,50-51,54-55,58-59] #/24
       ipmi: 10.0.1.[10-13,20-39,42-43,46-47,50-51,54-55,60-61] #/23
+      admbis:  10.1.[10-13/2,20-36/2,38-58/4].[9-10] #/24
+      databis: 10.5.[10-13/2,20-36/2,38-58/4].[1-2] #/24
       critical: True

Make sure that all services that are hosted on islet nodes have IP in this subnets.

On **islet switches** (only), create a new VLANs for those subnets:

.. code-block:: yaml

     esw12:
       vpc:
          domain: "12"
          peer_addr: "${address('esw13-eq')}"
          peer: "esw13"
          source: "${address('esw12-eq')}"
       ports:
          mgmt:
            0:
              addresses:
                - "${address('esw12-eq')}/23"
          Vlan:
            1:
              addresses:
                - "${address('esw12-adm')}/23"
            3:
              addresses:
                - "${address('esw12-transit')}/16"
            4:
              addresses:
                - "${address('esw12-admbis')}/24"
		- "${address('gw-adm-i12main')}/24"
		- "${address('gw-data-i12main')}/24"
		- "${address('gw-svc-i12main')}/24"

Using generated configurations, deploye this IP scheme on *islets switches*. This will create VLAN 4 and assign IPs on related virtual interfaces.

If this is correctly applied, thoses IPs should be reachable from *top switches* through BGP announces.

VLAN Interfaces
^^^^^^^^^^^^^^^

SVI Interfaces
""""""""""""""

First of all, allow islet worker to access the newly created VLAN:

.. code-block:: diff

  diff --git a/hiera/vlans.yaml b/hiera/vlans.yaml
  index 1116d01..36f2fea 100644
  --- a/hiera/vlans.yaml
  +++ b/hiera/vlans.yaml
  @@ -7,7 +7,7 @@ vlans:
         trunk: esw[1,2,10-13,20-39,42-43,46-47,50-51,54-55,58-59]
   
       4:
  -      trunk: esw[10-13,20-39,42-43,46-47,50-51,54-55,58-59]
  +      trunk: esw[10-13,20-39,42-43,46-47,50-51,54-55,58-59],islet[10-13,20-39,42-43,46-47,50-51,54-55,58-59]


This generates this kind of configuration modification:

.. code-block:: diff

  + switchport trunk allowed vlan 1,4
  + switchport
  + switchport mode trunk
  + switchport trunk native vlan 1


Applying this on switches must be done with care. Going from a ``access`` mode to a ``trunk`` mode port will most likely trigger a port reset. If this cannot be avoided, do it sequentially.

VLAN-aware bridges
""""""""""""""""""

Next, create a VLAN virtual interface on all islet hypervisors. This interface will use the already existing bridge (``bradm``) and be addressed on the new IP scheme.

.. code-block:: shell

  # /etc/sysconfig/network-scripts/ifcfg-bradm.4
  # Generated by Puppet
  DEVICE=bradm.4
  BOOTPROTO=static
  ONBOOT=yes
  IPADDR=10.1.XX.YY
  NETMASK=255.255.255.0
  MTU=9000
  VLAN=yes

Do not forget aliased interfaces (``data`` network for example), if any :

.. code-block:: shell

  # /etc/sysconfig/network-scripts/ifcfg-bradm.4\:data
  # Generated by Puppet
  DEVICE=bradm.4:data
  BOOTPROTO=static
  ONBOOT=yes
  IPADDR=10.5.XX.YY
  NETMASK=255.255.255.0
  MTU=9000
  VLAN=yes

Add some ``udev`` rules that will make tagged trafic work on this bridge. This tells that the physical interface (``enp59s0f1`` in this case) can handle VLAN-4 tagged traffic and bridge's self interface can handle VLAN-4 tagged traffic which makes the ``bradm.4`` interface work

.. code-block:: shell

   ACTION=="add", SUBSYSTEM=="net", KERNEL=="bradm.4", RUN+="/usr/sbin/bridge vlan add dev bradm vid 4 self"
   ACTION=="add", SUBSYSTEM=="net", KERNEL=="bradm.4", RUN+="/usr/sbin/bridge vlan add dev enp59s0f1 vid 4"
   ACTION=="remove", SUBSYSTEM=="net", KERNEL=="bradm.4", RUN+="/usr/sbin/bridge vlan del dev bradm vid 4 self"
   ACTION=="remove", SUBSYSTEM=="net", KERNEL=="bradm.4", RUN+="/usr/sbin/bridge vlan del dev enp59s0f1 vid 4"

Some policy-based routes are also required. This is to make sure that traffic comming out of those IPs goes onto the right interface.

.. code-block:: shell

  ::::::::::::::
  /etc/sysconfig/network-scripts/route-bradm.4
  ::::::::::::::
  10.0.0.0/8 table 100 nexthop via 10.1.12.254 nexthop via 10.1.12.253
  10.1.12.0/24 dev bradm.4 src 10.1.XX.YY table 100

  ::::::::::::::
  /etc/sysconfig/network-scripts/route-bradm.4:data
  ::::::::::::::
  10.5.0.0/24 table 100 nexthop via 10.5.12.254 nexthop via 10.5.12.253
  10.5.12.0/24 dev bradm.4:data src 10.5.XX.YY table 100

  ::::::::::::::
  /etc/sysconfig/network-scripts/rule-bradm.4
  ::::::::::::::
  from 10.1.0.0/23 iif lo lookup main pref 1000
  from 10.3.0.0/24 iif lo lookup main pref 1000
  from 10.5.0.0/24 iif lo lookup main pref 1000
  from 10.1.0.0/16 iif lo lookup 100 pref 1100
  from 10.3.0.0/16 iif lo lookup 100 pref 1100
  from 10.5.0.0/16 iif lo lookup 100 pref 1100


Do not forget to execute a ``ifup bradm.4`` after applying this configuration.

VLAN-aware VM interfaces
""""""""""""""""""""""""

Similar configuration as to be done on VMs interfaces. Pcocc can handle this during VMs starting procedure if it is correctly defined.

.. code-block:: yaml

  [...]
  adm:
  type: bridged
  settings:
    host-bridge: bradm
    tap-prefix: admtap
    mtu: 9000
    vlans:
    - vid: 1
      type: native
    - vid: 4
      type: tagged
  [...]

Running VMs can be *live-patched* with ``bridge`` commands and some ``xargs`` magic.

.. code-block:: shell

  ls -d /sys/class/net/bradm/brif/*tap* | xargs -n 1 basename | xargs -i -n 1 bridge vlan add dev {} vid 4


IP configuration of VMs interfaces is very similar to hypervisor, configure interfaces:

.. code-block:: shell

  #/etc/sysconfig/network-scripts/ifcfg-eth0.4
  # Generated by Puppet
  DEVICE=eth0.4
  BOOTPROTO=static
  ONBOOT=yes
  IPADDR=10.1.42.3
  NETMASK=255.255.255.0
  MTU=9000
  VLAN=yes

And, for the same reasons, policy-based routing:

.. code-block:: shell

  # /etc/sysconfig/network-scripts/rule-eth0.4
  from 10.1.0.0/23 iif lo lookup main pref 1000
  from 10.3.0.0/24 iif lo lookup main pref 1000
  from 10.5.0.0/24 iif lo lookup main pref 1000
  from 10.1.0.0/16 iif lo lookup 100 pref 1100
  from 10.3.0.0/16 iif lo lookup 100 pref 1100
  from 10.5.0.0/16 iif lo lookup 100 pref 1100

  #/etc/sysconfig/network-scripts/route-eth0.4
  default table 100 nexthop via 10.1.42.254
                    nexthop via 10.1.42.253
  10.1.42.0/24 dev eth0.4 src 10.1.42.3 table 100


VMs with VRRP-managed virtual IPs should be reconfigured too. This is not too difficult because ``keepalived`` can set multiple IP with a single VRRP instance ::

  vrrp_instance puppetserver_islet_42_1_215 {
    [...]
    track_interface {
      eth0
      eth0.4
    }

    virtual_ipaddress {
      10.3.0.18/24 dev eth0
      10.3.42.1/24 dev eth0.4
    }
    [...]
  }