Nexus 9000 Dynamic routing migration

This handbook is here to explain the migration processus between legacy IP routing (as designed initially) and the -better- dynamic routing architecture.

We’ll call top switches both Nexus 9336 and islet switches the Nexus 93180 pairs.

Source & target design

As initially discussed, the IP routing architecture is designed so there is a single hop between each and any node in the ethernet fabric. This is seen as a performance optimization.

But this design introduces such a configuration overhead that it alienates most of configuration changes and make debugging not so natural.

Currently top switches are present in all VLANs and islet switches have a set of islet-specific VLANs and the VLAN 1 in common with top switches.

Using this design, IP routing is asymetric:

  • Communications initiated from islets to tops are routed by islet switches

  • Communications initiated from tops to islets are routed by top switches

The target design uses a new VLAN as a transit VLAN and locally managed subnet are announced on this VLAN using a dynamic routing protocol like BGP.

This is sub-optimal in terms of number of routing hops. But the configuration is much simpler in this case.

The legacy IP routing design as been taken into account in the IP allocation scheme : because VLAN 1 is shared between top managements node and islets nodes, they are in the same subnet.

As such, one the requirements is the reallocation of all subnets that are shared between top switches VLANs and islet switches VLANs.

Migration process

Transit VLAN

Allocate and deploy the transit VLAN and the associated IP allocation scheme. Which can be done with the following change in confiture’s configuration.

commit 67da4312a4a97d698c33e252f8a34034670980ac
Author: John Doe <john.doe@noreply.fr>
Date:   Tue Oct 6 16:57:45 2020 +0200

    Add VLAN 3 (Transit)

diff --git a/hiera/addresses.yaml b/hiera/addresses.yaml
index dcdcc6d..51c33dc 100644
--- a/hiera/addresses.yaml
+++ b/hiera/addresses.yaml
@@ -944,6 +944,7 @@ addresses:
       eq: 10.0.0.[128-163] #/24
       adm: 10.1.0.[128-163] #/24
       eqs: 10.32.0.[128-163] #/24
+      transit: 169.254.0.[1-36]
       critical: True

     ## Infiband top switches (manageable ones)
diff --git a/hiera/switches.yaml b/hiera/switches.yaml
index 044d891..aed1de6 100644
--- a/hiera/switches.yaml
+++ b/hiera/switches.yaml
@@ -21,6 +21,9 @@ switches:
             2:
               addresses:
                 - "${address('gw-common-equipment-core1')}/24"
+            3:
+              addresses:
+                - "${address('esw1-transit')}/16"

By deploying this on the top switches and islet switches you can now make them communicate on this VLAN.

# clush -bw esw[1,2,12-13,38-58/4,39-59/4] "ping 169.254.0.1"
---------------
esw1
---------------
PING 169.254.0.1 (169.254.0.1): 56 data bytes
64 bytes from 169.254.0.1: icmp_seq=0 ttl=255 time=0.357 ms
64 bytes from 169.254.0.1: icmp_seq=1 ttl=255 time=0.298 ms
64 bytes from 169.254.0.1: icmp_seq=2 ttl=255 time=0.249 ms
64 bytes from 169.254.0.1: icmp_seq=3 ttl=255 time=0.403 ms
64 bytes from 169.254.0.1: icmp_seq=4 ttl=255 time=0.306 ms

--- 169.254.0.1 ping statistics ---
5 packets transmitted, 5 packets received, 0.00% packet loss
round-trip min/avg/max = 0.249/0.322/0.403 ms
[...]

Warning

If you’re using ACLs, don’t forget to add the according rule. In this case:

object-group ip address transit_subnet
  10 169.254.0.0/16
ip access-list ROUTING_ACL
  400 permit ip addrgroup transit_subnet addrgroup transit_subnet vlan 3

BGP Sessions

Create and start BGP sessions on top switches and islet switches. The resulting configuration items is the following on Nexus 9000 switches.

On top switches:

feature bgp
route-map ROUTE_MAP permit 10
  match as-number 4294967294
router bgp 4294967294
  address-family ipv4 unicast
    redistribute direct route-map ROUTE_MAP
    maximum-paths ibgp 10
router bgp 4294967294
  neighbor 169.254.0.0/16
    update-source Vlan3
    remote-as 4294967294
    address-family ipv4 unicast
      route-reflector-client

On islet switches:

feature bgp
route-map ROUTE_MAP permit 10
  match as-number 4294967294
router bgp 4294967294
  address-family ipv4 unicast
    redistribute direct route-map ROUTE_MAP
    maximum-paths ibgp 10
router bgp 4294967294
  neighbor 169.254.0.1
    update-source Vlan3
    remote-as 4294967294
    address-family ipv4 unicast
      next-hop-self
  neighbor 169.254.0.2
    update-source Vlan3
    remote-as 4294967294
    address-family ipv4 unicast
      next-hop-self

This configuration creates iBGP peering between islets switches and top switches. And top switches are configured to redistribute leaned routes to other clients (iBGP route reflection).

Using this configuration, any subnet directly connected to any of the switches will be reachable from others.

IP Scheme

Add and allocate the new IP scheme : islet management nodes (and switches) have their own subnets (service, data and management subnets).

For example, the islet management node IP allocation has been updated as follows:

adm:  10.1.0.[10-13,20-39,42-43,46-47,50-51,54-55,58-59] #/23
data: 10.5.0.[10-13,20-39,42-43,46-47,50-51,54-55,58-59] #/24
ipmi: 10.0.1.[10-13,20-39,42-43,46-47,50-51,54-55,60-61] #/23
  • admbis: 10.1.[10-13/2,20-36/2,38-58/4].[9-10] #/24

  • databis: 10.5.[10-13/2,20-36/2,38-58/4].[1-2] #/24 critical: True

Make sure that all services that are hosted on islet nodes have IP in this subnets.

On islet switches (only), create a new VLANs for those subnets:

esw12:
  vpc:
     domain: "12"
     peer_addr: "${address('esw13-eq')}"
     peer: "esw13"
     source: "${address('esw12-eq')}"
  ports:
     mgmt:
       0:
         addresses:
           - "${address('esw12-eq')}/23"
     Vlan:
       1:
         addresses:
           - "${address('esw12-adm')}/23"
       3:
         addresses:
           - "${address('esw12-transit')}/16"
       4:
         addresses:
           - "${address('esw12-admbis')}/24"
           - "${address('gw-adm-i12main')}/24"
           - "${address('gw-data-i12main')}/24"
           - "${address('gw-svc-i12main')}/24"

Using generated configurations, deploye this IP scheme on islets switches. This will create VLAN 4 and assign IPs on related virtual interfaces.

If this is correctly applied, thoses IPs should be reachable from top switches through BGP announces.

VLAN Interfaces

SVI Interfaces

First of all, allow islet worker to access the newly created VLAN:

diff --git a/hiera/vlans.yaml b/hiera/vlans.yaml
index 1116d01..36f2fea 100644
--- a/hiera/vlans.yaml
+++ b/hiera/vlans.yaml
@@ -7,7 +7,7 @@ vlans:
       trunk: esw[1,2,10-13,20-39,42-43,46-47,50-51,54-55,58-59]

     4:
-      trunk: esw[10-13,20-39,42-43,46-47,50-51,54-55,58-59]
+      trunk: esw[10-13,20-39,42-43,46-47,50-51,54-55,58-59],islet[10-13,20-39,42-43,46-47,50-51,54-55,58-59]

This generates this kind of configuration modification:

+ switchport trunk allowed vlan 1,4
+ switchport
+ switchport mode trunk
+ switchport trunk native vlan 1

Applying this on switches must be done with care. Going from a access mode to a trunk mode port will most likely trigger a port reset. If this cannot be avoided, do it sequentially.

VLAN-aware bridges

Next, create a VLAN virtual interface on all islet hypervisors. This interface will use the already existing bridge (bradm) and be addressed on the new IP scheme.

# /etc/sysconfig/network-scripts/ifcfg-bradm.4
# Generated by Puppet
DEVICE=bradm.4
BOOTPROTO=static
ONBOOT=yes
IPADDR=10.1.XX.YY
NETMASK=255.255.255.0
MTU=9000
VLAN=yes

Do not forget aliased interfaces (data network for example), if any :

# /etc/sysconfig/network-scripts/ifcfg-bradm.4\:data
# Generated by Puppet
DEVICE=bradm.4:data
BOOTPROTO=static
ONBOOT=yes
IPADDR=10.5.XX.YY
NETMASK=255.255.255.0
MTU=9000
VLAN=yes

Add some udev rules that will make tagged trafic work on this bridge. This tells that the physical interface (enp59s0f1 in this case) can handle VLAN-4 tagged traffic and bridge’s self interface can handle VLAN-4 tagged traffic which makes the bradm.4 interface work

ACTION=="add", SUBSYSTEM=="net", KERNEL=="bradm.4", RUN+="/usr/sbin/bridge vlan add dev bradm vid 4 self"
ACTION=="add", SUBSYSTEM=="net", KERNEL=="bradm.4", RUN+="/usr/sbin/bridge vlan add dev enp59s0f1 vid 4"
ACTION=="remove", SUBSYSTEM=="net", KERNEL=="bradm.4", RUN+="/usr/sbin/bridge vlan del dev bradm vid 4 self"
ACTION=="remove", SUBSYSTEM=="net", KERNEL=="bradm.4", RUN+="/usr/sbin/bridge vlan del dev enp59s0f1 vid 4"

Some policy-based routes are also required. This is to make sure that traffic comming out of those IPs goes onto the right interface.

::::::::::::::
/etc/sysconfig/network-scripts/route-bradm.4
::::::::::::::
10.0.0.0/8 table 100 nexthop via 10.1.12.254 nexthop via 10.1.12.253
10.1.12.0/24 dev bradm.4 src 10.1.XX.YY table 100

::::::::::::::
/etc/sysconfig/network-scripts/route-bradm.4:data
::::::::::::::
10.5.0.0/24 table 100 nexthop via 10.5.12.254 nexthop via 10.5.12.253
10.5.12.0/24 dev bradm.4:data src 10.5.XX.YY table 100

::::::::::::::
/etc/sysconfig/network-scripts/rule-bradm.4
::::::::::::::
from 10.1.0.0/23 iif lo lookup main pref 1000
from 10.3.0.0/24 iif lo lookup main pref 1000
from 10.5.0.0/24 iif lo lookup main pref 1000
from 10.1.0.0/16 iif lo lookup 100 pref 1100
from 10.3.0.0/16 iif lo lookup 100 pref 1100
from 10.5.0.0/16 iif lo lookup 100 pref 1100

Do not forget to execute a ifup bradm.4 after applying this configuration.

VLAN-aware VM interfaces

Similar configuration as to be done on VMs interfaces. Pcocc can handle this during VMs starting procedure if it is correctly defined.

[...]
adm:
type: bridged
settings:
  host-bridge: bradm
  tap-prefix: admtap
  mtu: 9000
  vlans:
  - vid: 1
    type: native
  - vid: 4
    type: tagged
[...]

Running VMs can be live-patched with bridge commands and some xargs magic.

ls -d /sys/class/net/bradm/brif/*tap* | xargs -n 1 basename | xargs -i -n 1 bridge vlan add dev {} vid 4

IP configuration of VMs interfaces is very similar to hypervisor, configure interfaces:

#/etc/sysconfig/network-scripts/ifcfg-eth0.4
# Generated by Puppet
DEVICE=eth0.4
BOOTPROTO=static
ONBOOT=yes
IPADDR=10.1.42.3
NETMASK=255.255.255.0
MTU=9000
VLAN=yes

And, for the same reasons, policy-based routing:

# /etc/sysconfig/network-scripts/rule-eth0.4
from 10.1.0.0/23 iif lo lookup main pref 1000
from 10.3.0.0/24 iif lo lookup main pref 1000
from 10.5.0.0/24 iif lo lookup main pref 1000
from 10.1.0.0/16 iif lo lookup 100 pref 1100
from 10.3.0.0/16 iif lo lookup 100 pref 1100
from 10.5.0.0/16 iif lo lookup 100 pref 1100

#/etc/sysconfig/network-scripts/route-eth0.4
default table 100 nexthop via 10.1.42.254
                  nexthop via 10.1.42.253
10.1.42.0/24 dev eth0.4 src 10.1.42.3 table 100

VMs with VRRP-managed virtual IPs should be reconfigured too. This is not too difficult because keepalived can set multiple IP with a single VRRP instance

vrrp_instance puppetserver_islet_42_1_215 {
  [...]
  track_interface {
    eth0
    eth0.4
  }

  virtual_ipaddress {
    10.3.0.18/24 dev eth0
    10.3.42.1/24 dev eth0.4
  }
  [...]
}