Nexus 9000 Dynamic routing migration ==================================== This handbook is here to explain the migration processus between legacy IP routing (as designed initially) and the -better- dynamic routing architecture. We'll call **top switches** both Nexus 9336 and **islet switches** the Nexus 93180 pairs. Source & target design ---------------------- As initially discussed, the IP routing architecture is designed so there is a single hop between each and any node in the ethernet fabric. This is seen as a performance optimization. But this design introduces such a configuration overhead that it alienates most of configuration changes and make debugging not so natural. Currently **top switches** are present in all VLANs and **islet switches** have a set of islet-specific VLANs and the VLAN 1 in common with **top switches**. .. nwdiag:: nwdiag{ group { isletA } group { isletB } group { eswA } group{ eswB } group { eswT } network top_vlans{ address = "10.1.0.0/23" label = "Management VLANs" top isletA isletB eswT eswA eswB } network islet10_vlan{ address = "10.X.10.0/24" label = "Islet 10 VLANs" eswA eswT } network islet20_vlan{ address = "10.X.20.0/24" label = "Islet 20 VLANs" eswB eswT } } .. nwdiag:: nwdiag{ group { isletA } group { isletB } group { eswA } group{ eswB } group { eswT } network top_vlans{ address = "10.1.0.0/23" label = "Management VLANs" top eswT } network islet10_vlan{ address = "10.X.10.0/24" label = "Islet 10 VLANs" isletA eswA } network islet20_vlan{ address = "10.X.20.0/24" label = "Islet 20 VLANs" isletB eswB } network ethernet_fabric{ address = "169.254.0.0/16" label = "Transit VLAN" eswT eswA eswB } } Using this design, IP routing is asymetric: * Communications initiated from islets to tops are routed by islet switches * Communications initiated from tops to islets are routed by top switches The target design uses a new VLAN as a transit VLAN and locally managed subnet are announced on this VLAN using a dynamic routing protocol like BGP. .. nwdiag:: nwdiag{ group { isletA } group { isletB } group { eswA eswB } group { eswT } network top_vlans{ address = "10.1.0.0/23" label = "Management VLANs" top eswT } network ethernet_fabric{ address = "169.254.0.0/16" label = "Transit VLAN" eswT eswA eswB } network islet10_vlan{ address = "10.X.10.0/24" label = "Islet 10 VLANs" isletA eswA } network islet20_vlan{ address = "10.X.20.0/24" label = "Islet 20 VLANs" isletB eswB } } This is sub-optimal in terms of number of routing hops. But the configuration is much simpler in this case. The legacy IP routing design as been taken into account in the IP allocation scheme : because VLAN 1 is shared between top managements node and islets nodes, they are in the same subnet. As such, one the requirements is the reallocation of all subnets that are shared between **top switches** VLANs and **islet switches** VLANs. Migration process ----------------- Transit VLAN ^^^^^^^^^^^^ Allocate and deploy the transit VLAN and the associated IP allocation scheme. Which can be done with the following change in **confiture**'s configuration. .. code-block:: diff commit 67da4312a4a97d698c33e252f8a34034670980ac Author: John Doe Date: Tue Oct 6 16:57:45 2020 +0200 Add VLAN 3 (Transit) diff --git a/hiera/addresses.yaml b/hiera/addresses.yaml index dcdcc6d..51c33dc 100644 --- a/hiera/addresses.yaml +++ b/hiera/addresses.yaml @@ -944,6 +944,7 @@ addresses: eq: 10.0.0.[128-163] #/24 adm: 10.1.0.[128-163] #/24 eqs: 10.32.0.[128-163] #/24 + transit: 169.254.0.[1-36] critical: True ## Infiband top switches (manageable ones) diff --git a/hiera/switches.yaml b/hiera/switches.yaml index 044d891..aed1de6 100644 --- a/hiera/switches.yaml +++ b/hiera/switches.yaml @@ -21,6 +21,9 @@ switches: 2: addresses: - "${address('gw-common-equipment-core1')}/24" + 3: + addresses: + - "${address('esw1-transit')}/16" By deploying this on the **top switches** and **islet switches** you can now make them communicate on this VLAN. .. code-block:: console # clush -bw esw[1,2,12-13,38-58/4,39-59/4] "ping 169.254.0.1" --------------- esw1 --------------- PING 169.254.0.1 (169.254.0.1): 56 data bytes 64 bytes from 169.254.0.1: icmp_seq=0 ttl=255 time=0.357 ms 64 bytes from 169.254.0.1: icmp_seq=1 ttl=255 time=0.298 ms 64 bytes from 169.254.0.1: icmp_seq=2 ttl=255 time=0.249 ms 64 bytes from 169.254.0.1: icmp_seq=3 ttl=255 time=0.403 ms 64 bytes from 169.254.0.1: icmp_seq=4 ttl=255 time=0.306 ms --- 169.254.0.1 ping statistics --- 5 packets transmitted, 5 packets received, 0.00% packet loss round-trip min/avg/max = 0.249/0.322/0.403 ms [...] .. warning:: If you're using ACLs, don't forget to add the according rule. In this case:: object-group ip address transit_subnet 10 169.254.0.0/16 ip access-list ROUTING_ACL 400 permit ip addrgroup transit_subnet addrgroup transit_subnet vlan 3 BGP Sessions ^^^^^^^^^^^^ Create and start BGP sessions on **top switches** and **islet switches**. The resulting configuration items is the following on Nexus 9000 switches. On **top switches**:: feature bgp route-map ROUTE_MAP permit 10 match as-number 4294967294 router bgp 4294967294 address-family ipv4 unicast redistribute direct route-map ROUTE_MAP maximum-paths ibgp 10 router bgp 4294967294 neighbor 169.254.0.0/16 update-source Vlan3 remote-as 4294967294 address-family ipv4 unicast route-reflector-client On **islet switches**:: feature bgp route-map ROUTE_MAP permit 10 match as-number 4294967294 router bgp 4294967294 address-family ipv4 unicast redistribute direct route-map ROUTE_MAP maximum-paths ibgp 10 router bgp 4294967294 neighbor 169.254.0.1 update-source Vlan3 remote-as 4294967294 address-family ipv4 unicast next-hop-self neighbor 169.254.0.2 update-source Vlan3 remote-as 4294967294 address-family ipv4 unicast next-hop-self This configuration creates iBGP peering between **islets switches** and **top switches**. And **top switches** are configured to redistribute leaned routes to other clients (iBGP route reflection). Using this configuration, any subnet directly connected to any of the switches will be reachable from others. IP Scheme ^^^^^^^^^ Add and allocate the new IP scheme : islet management nodes (and switches) have their own subnets (service, data and management subnets). For example, the **islet** management node IP allocation has been updated as follows: .. code-block:: diff adm: 10.1.0.[10-13,20-39,42-43,46-47,50-51,54-55,58-59] #/23 data: 10.5.0.[10-13,20-39,42-43,46-47,50-51,54-55,58-59] #/24 ipmi: 10.0.1.[10-13,20-39,42-43,46-47,50-51,54-55,60-61] #/23 + admbis: 10.1.[10-13/2,20-36/2,38-58/4].[9-10] #/24 + databis: 10.5.[10-13/2,20-36/2,38-58/4].[1-2] #/24 critical: True Make sure that all services that are hosted on islet nodes have IP in this subnets. On **islet switches** (only), create a new VLANs for those subnets: .. code-block:: yaml esw12: vpc: domain: "12" peer_addr: "${address('esw13-eq')}" peer: "esw13" source: "${address('esw12-eq')}" ports: mgmt: 0: addresses: - "${address('esw12-eq')}/23" Vlan: 1: addresses: - "${address('esw12-adm')}/23" 3: addresses: - "${address('esw12-transit')}/16" 4: addresses: - "${address('esw12-admbis')}/24" - "${address('gw-adm-i12main')}/24" - "${address('gw-data-i12main')}/24" - "${address('gw-svc-i12main')}/24" Using generated configurations, deploye this IP scheme on *islets switches*. This will create VLAN 4 and assign IPs on related virtual interfaces. If this is correctly applied, thoses IPs should be reachable from *top switches* through BGP announces. VLAN Interfaces ^^^^^^^^^^^^^^^ SVI Interfaces """""""""""""" First of all, allow islet worker to access the newly created VLAN: .. code-block:: diff diff --git a/hiera/vlans.yaml b/hiera/vlans.yaml index 1116d01..36f2fea 100644 --- a/hiera/vlans.yaml +++ b/hiera/vlans.yaml @@ -7,7 +7,7 @@ vlans: trunk: esw[1,2,10-13,20-39,42-43,46-47,50-51,54-55,58-59] 4: - trunk: esw[10-13,20-39,42-43,46-47,50-51,54-55,58-59] + trunk: esw[10-13,20-39,42-43,46-47,50-51,54-55,58-59],islet[10-13,20-39,42-43,46-47,50-51,54-55,58-59] This generates this kind of configuration modification: .. code-block:: diff + switchport trunk allowed vlan 1,4 + switchport + switchport mode trunk + switchport trunk native vlan 1 Applying this on switches must be done with care. Going from a ``access`` mode to a ``trunk`` mode port will most likely trigger a port reset. If this cannot be avoided, do it sequentially. VLAN-aware bridges """""""""""""""""" Next, create a VLAN virtual interface on all islet hypervisors. This interface will use the already existing bridge (``bradm``) and be addressed on the new IP scheme. .. code-block:: shell # /etc/sysconfig/network-scripts/ifcfg-bradm.4 # Generated by Puppet DEVICE=bradm.4 BOOTPROTO=static ONBOOT=yes IPADDR=10.1.XX.YY NETMASK=255.255.255.0 MTU=9000 VLAN=yes Do not forget aliased interfaces (``data`` network for example), if any : .. code-block:: shell # /etc/sysconfig/network-scripts/ifcfg-bradm.4\:data # Generated by Puppet DEVICE=bradm.4:data BOOTPROTO=static ONBOOT=yes IPADDR=10.5.XX.YY NETMASK=255.255.255.0 MTU=9000 VLAN=yes Add some ``udev`` rules that will make tagged trafic work on this bridge. This tells that the physical interface (``enp59s0f1`` in this case) can handle VLAN-4 tagged traffic and bridge's self interface can handle VLAN-4 tagged traffic which makes the ``bradm.4`` interface work .. code-block:: shell ACTION=="add", SUBSYSTEM=="net", KERNEL=="bradm.4", RUN+="/usr/sbin/bridge vlan add dev bradm vid 4 self" ACTION=="add", SUBSYSTEM=="net", KERNEL=="bradm.4", RUN+="/usr/sbin/bridge vlan add dev enp59s0f1 vid 4" ACTION=="remove", SUBSYSTEM=="net", KERNEL=="bradm.4", RUN+="/usr/sbin/bridge vlan del dev bradm vid 4 self" ACTION=="remove", SUBSYSTEM=="net", KERNEL=="bradm.4", RUN+="/usr/sbin/bridge vlan del dev enp59s0f1 vid 4" Some policy-based routes are also required. This is to make sure that traffic comming out of those IPs goes onto the right interface. .. code-block:: shell :::::::::::::: /etc/sysconfig/network-scripts/route-bradm.4 :::::::::::::: 10.0.0.0/8 table 100 nexthop via 10.1.12.254 nexthop via 10.1.12.253 10.1.12.0/24 dev bradm.4 src 10.1.XX.YY table 100 :::::::::::::: /etc/sysconfig/network-scripts/route-bradm.4:data :::::::::::::: 10.5.0.0/24 table 100 nexthop via 10.5.12.254 nexthop via 10.5.12.253 10.5.12.0/24 dev bradm.4:data src 10.5.XX.YY table 100 :::::::::::::: /etc/sysconfig/network-scripts/rule-bradm.4 :::::::::::::: from 10.1.0.0/23 iif lo lookup main pref 1000 from 10.3.0.0/24 iif lo lookup main pref 1000 from 10.5.0.0/24 iif lo lookup main pref 1000 from 10.1.0.0/16 iif lo lookup 100 pref 1100 from 10.3.0.0/16 iif lo lookup 100 pref 1100 from 10.5.0.0/16 iif lo lookup 100 pref 1100 Do not forget to execute a ``ifup bradm.4`` after applying this configuration. VLAN-aware VM interfaces """""""""""""""""""""""" Similar configuration as to be done on VMs interfaces. Pcocc can handle this during VMs starting procedure if it is correctly defined. .. code-block:: yaml [...] adm: type: bridged settings: host-bridge: bradm tap-prefix: admtap mtu: 9000 vlans: - vid: 1 type: native - vid: 4 type: tagged [...] Running VMs can be *live-patched* with ``bridge`` commands and some ``xargs`` magic. .. code-block:: shell ls -d /sys/class/net/bradm/brif/*tap* | xargs -n 1 basename | xargs -i -n 1 bridge vlan add dev {} vid 4 IP configuration of VMs interfaces is very similar to hypervisor, configure interfaces: .. code-block:: shell #/etc/sysconfig/network-scripts/ifcfg-eth0.4 # Generated by Puppet DEVICE=eth0.4 BOOTPROTO=static ONBOOT=yes IPADDR=10.1.42.3 NETMASK=255.255.255.0 MTU=9000 VLAN=yes And, for the same reasons, policy-based routing: .. code-block:: shell # /etc/sysconfig/network-scripts/rule-eth0.4 from 10.1.0.0/23 iif lo lookup main pref 1000 from 10.3.0.0/24 iif lo lookup main pref 1000 from 10.5.0.0/24 iif lo lookup main pref 1000 from 10.1.0.0/16 iif lo lookup 100 pref 1100 from 10.3.0.0/16 iif lo lookup 100 pref 1100 from 10.5.0.0/16 iif lo lookup 100 pref 1100 #/etc/sysconfig/network-scripts/route-eth0.4 default table 100 nexthop via 10.1.42.254 nexthop via 10.1.42.253 10.1.42.0/24 dev eth0.4 src 10.1.42.3 table 100 VMs with VRRP-managed virtual IPs should be reconfigured too. This is not too difficult because ``keepalived`` can set multiple IP with a single VRRP instance :: vrrp_instance puppetserver_islet_42_1_215 { [...] track_interface { eth0 eth0.4 } virtual_ipaddress { 10.3.0.18/24 dev eth0 10.3.42.1/24 dev eth0.4 } [...] }