VXLAN

VXLAN Packet Header

VXLAN Header
VXLAN Packet Structure

VXLAN RT and RD

Route Distinguisher (the auto method)

  • Used in BGP table to keep all routes unique

  • 4-bytes admin field: 2-byte numbering field, ex: 12.12.12.12:32777 or 13.13.13.13:3

    • Admin field is BGP router ID

    • The numbering field is the internal VRF ID

      • The numbering field for L2/MAC addresses starts with 32767 + VLAN number

      • The numbering field for L3/IP addresses starts with 3 (1&2 are reserved fro default and management VRF).

Route Target (the auto method)

  • Encoded in BGP extended community

  • Consists of 2-bytes admin field and a 4-byte number field, ex: 2:100000

    • The admin field is the BGP ASN

    • THe Numbering field is the tenant VNI

  • For Multi-AS environments, the Route Target must either be statically defined or rewritten to match the ASN portion of the Route-Targets

  • Examples of an auto derived Route Target:

    • IP-VRF within ASN 65001 and L3VNI 50001 - Route Target 65001:50001

    • MAC-VRF within ASN 65001 and L2VNI 30001 - Route Target 65001:30001

Configuration

VXLAN Prerequisites

  • Prerequisites are hardware/software specific

  • For Nexus 5600 as hardware VTEP

    • Set switching mode to store-and-forward and reboot: hardware ethernet store-and-fwd-switching

    • Establish IP unicast reachability between VTEPs

    • Establish PIM BIDIR reachability between VTEPs

      • SPines can be phantom RPs for redundancy

    • Enable features:

      • feature vn-segment-vlan-based

      • feature nv overlay

Flood and Learn

  • Map VLAN to VXLAN: vn-segment under vlan config mode

  • Create Network Virtualization Edge (NVE) interface: interface nve

  • Specify VTEP source: source interface loopback0

  • Specify VNI membership: member vni [vnid]

  • Specify multicast group for BUM replication: mcast-group [group]

  • Multicast group 228.9.10.11 in this example must be the same on all VTEPs

  • VNID 11110 is local significant on each VTEPs

BGP EVPN

  • Map VLAN to VXLAN: vn-segment under vlan config mode

  • Create Network Virtualization Edge (NVE) interface: interface nve

  • Specify VTEP source: source interface loopback0

  • Specify VNI membership: member vni [vnid]

  • Specify multicast group for BUM replication: mcast-group [group]

  • Specify BGP as control plane protocol: host-reachability protocol bgp

  • Extablish BGP EVPN peerings

    • address-family l2vpn evpn

    • extended communities required

  • Generate BGP advertisement (like network statement)

    • evpn

      • vni [vnid] l2

        • rd auto

        • route-target import auto

        • route-target export auto

Operation Steps

  • Map VLANs to VXLAN Network Identifiers (VNIs/VNIDs)

  • Advertise information into BGP

    • MAC to L2 VNI to VTEP mapping

    • IP to L3 VNI to VTEP mapping

  • Import MAC addresses into the CAM table for bridging

  • Route traffic through SVIs to remote segments

Spine configuration

  • Spine is Route Reflector

Leaf configuration

Verification

  • show interface nve id

  • show platform fwm info nve peer|vni [all]

  • show mac address-table [vlan id]

  • show nve peer|vni

  • show bgp l2vpn evpn [summary]

  • show bgp l2vpn evpn neighbor $address advertised-routes

  • show ip mroute 228.9.10.11

  • show nve vni

  • show nve peers

  • show l2route evpn mac all

  • show l2route evpn mac-ip all

  • show nve internal bgp rnh database (rnh: recursive next hop)

  • show system internal l2rib event mac

  • show fabric forwarding internal event-history events

  • show fabric forwarding ip local-host-db vrf $VRF

Inter-VLAN Routing - Asymmetric vs Symmetric IRB

  • EVPN Intergraged Routing and Bridging (IRB) has two options:

    • Asymmetric IRB

    • Symmetric IRB

  • Asymmetric IRB

    • Ingress VTEP (Leaf) does both L2 and L3 lookup

    • Egress VTEP does L2 lookup only

    • I.e. Bridge - Route - Bridge -> Need to configure SVI for all segments on all VTEPs as it will need for both forward and return traffic -> not efficient as it will increase ARP cache and CAM table size and control plane scaling issue

  • Symmetric IRB

    • Ingress VTEP does both L2 and L3 lookup

    • Egress VTEP does both L3 and L2 Lookup

    • I.e. Bridge - Route - Route - Bridge

How Symmetric IRB works

  • New concept called Layer-3 VNI

  • Each tenant VRF is mapped to a unique Layer-3 VNI

    • Mapping mus match on all VTEPs

  • All VXLAN routed traffic is encapsulated with L3 VNI in VXLAN header which allows for a single shared VNI among all VTEPs

  • L2 VNIs only need to be configured where access ports exist -> saving of ARP and CAM table spaces

Configuration

vPC and VXLAN BGP

  • VXLAN Traffic is tunneled over the underlay network using the BGP next-hop address of the remote VTEP

    • NVE source interface (i.e. loopback0) is the default BGP next-hop for advertised routes

    • in a vPC, both vPC peers advertise duplicate EVPN MAC/IP routes to spine RRs

    • With other attributes equal, next-hop is tie breaker in BGP Best Path Seletion

    • Implies that one vPC peer is always preferred for dual attached hosts

    • Result is that egress traffic from vPC member is load balanced, but return ingress traffic is polarized -> Solution is to use Anycast VTEP address

  • vPC peers share duplicate IP address on NVE source interface

    • Peer 1 - interface loop 0; ip address 1.1.1.51/32

    • Peer 2 - interface loop 0; ip address 1.1.1.52/32

    • Both peers - interface loopback0; ip address 1.1.1.111/32 secondary

  • BGP next-hop is automatically set to secondary address for locally originated routes: i.e. L2VPN EVPN MAC/IP Routes for vPC member ports. This can be changed to primary ip address by using below configuration

  • Result is that ingress flows from spines are load balanced. Other leafs use IGP ECMP to reach shared secondary ip address

  • on Nexus 5600, all traffic across the vPC peeer link must be VXLAN encapsulated due to ASIC implementation

  • Normal vPC Peer link is a classical ethernet trunk

    • Result is that East/West flows over vPC Peer link are broken by default

    • i.e. the VNI number is lost when packet is sent out the peer link

  • Peer link is normally only used for orphans or in failure scenarios

    • Result is that everything looks fine until the failure occurs

    • Traffic to orphans & single attached members black-holed over vPC peer link

  • Workaround is to maintain VXLAN encapsulation across Peer Link: vpc nve peer-link-vlan

    • Create new VLAN and specify as NVE Peer Link VLAN

      • vlan 999

      • vpc nve peer-link-vlan 999

    • Establish layer 3 peering across NVE Peer Link VLAN

      • interface vlan 999

      • ip router ospf 1 area 0

      • ip router isis 1

    • Traffic engineer so other vPC Peer's VTEP loopback is preferred over vPC Peer Link

      • ip ospf cost 10

      • isis metric 10 level-2

VXLAN Underlay Fabric Convergence

  • VXLAN underlay fabric convergence is based on three factors which must be addressed separately to achieve High Availability for VXLAN overlay flows

    • IGP convergence

    • PIM convergence

    • BGP convergence

  • 4 Factors generally affect IGP convergence time

    • Failure detection time: is the neighbor down?

      • Link up/down event

      • Routing protocol hello/dead timers

      • IP SLA & EEM

      • Bidirectional Forwarding Detection (BFD)

    • Event Propagation Time: tell neighbors about the change

      • EIGRP Query/Reply

      • OSPF LSA Flooding Procedure

      • BGP Update/Withdraw

    • Recalculation time: Run SPF/DUAL/etc. calculation

      • EIGRP DUAL

      • OSPF SPF

      • BGP Best Path Selection

    • Forwarding Table update Time: Install new paths

      • EIGRP topology to RIB download

      • RIB to Software FIB Download

      • Software FIB to hardware TCAM download

Methods of Modifying Convergence Time

  • Reactive optimizations

    • e.g. carrier delay and link debounce timer

    • e.g. Fast Hellos & BFD

    • e.g. OSPF LSA & SPF pacing

    • e.g. FIB prefix prioritization

  • Proactive optimizations

    • EIGRP Feasible successors

    • OSPF Loop Free Alternate (LFA)

    • BGP Prefix Independent Convergence (PIC)

    • MPLS Traffic Engineering Fast Reroute (TE FRR)

BFD

Verification

  • show bfd neighbor

PIM Convergence

  • Generally two factors affect PIM convergence time

    • Neighbor Failure Detection Time: Is the PIM neighbor down

    • RP Failure Detection Time

      • Can I still join the (*,G)? - ASM and BIDIR

      • Can I still register the (S,G)? - ASM only

  • PIM RP, like a BGP RR, adds High Availability by adding Node Redundancy: RP should never be a single point of failure

  • Redundancy Design depends on PIM design

    • Any Source Multicast (ASM)

      • Auto-RP with multiple candidate mapping agents and RPs (slow convergence)

      • BSR with multiple BSR and RP candidates (slow convergence)

      • Anycast RP

    • Bidirectional RP (BiDir)

      • Phantom RP

    • Source Specific Multicast (SSM)

      • No RPs used, no redundancy needed

Anycast RP

  • Adds redundancy by sharing RP address between multiple nodes: e.g. duplicate loopback1 address advertised into IGP

  • Multicast control plane must sync between anycast RPs

    • Multicast Source Discovery Protocol (MSDP)

    • PIM anycast

Verification

  • debug ip pim data-register send

  • debug ip pim data-register receive

  • debug ip pim null-register

Phantom RP

  • BiDir PIM does not use REgister or (S,G) join

    • only multicast state is (*,G) rooted at the RP

    • Implies that anycast isn't needed to sync state

  • Phantom RP provides redundancy based on longest match routing

    • Primary RP advertises longest match into IGP

    • Secondary RP advertises next longest match into IGP

    • Primary RP fails, secondary RP's address now becomes longest match

Verification

  • show ip pim rp

External Routing

  • By defaults, hosts in VXLAN fabric are isolated from the rest of the network

    • E.g. underlay fabric is in "default" VRF, while servers are in the tenant VRFs

  • Assumption is that tenant applications need external reachability

    • E.g. web clients on internet need access to web server in VXLAN fabric

  • Border leafs are used to connect external networks to internal fabric

  • Border leafs run multiple copies of the routing control plane

    • MP-BGP L2VPN EVPN to VTEPS inside VXLAN fabric

    • Tenant VRF aware IPv4/IPv6 Unicast BGP or IGP to external router(s)

    • MP-BGP to BGP/IGP redistribution occurs on Border Leaf

  • External router(s) can have Tenant VRFs

    • Allows for overlapping addressing inside Tenant networks, e.g. VRF-Lite

  • External router(s) don't require Tenant VRFs

    • can mix all routes into default routing table as long as addresses are unique

  • Border leafs maintain all host routes for all Tenant VRFs

    • e.g. they must import all prefixes into VRFs from MP-BGP

  • External Routers don't need host routes, just aggregates

    • Summarization should occurs at MP-BGP to BGP/IGP redistribution point

    • Route leaking could be used for longer match routing traffic engineering

Border Leaf Configuration

External Router Configuration

Verification

Border Leaf

  • show bgp vrf SHARED ipv4 unicast summary

VXLAN EVPN Multisite

Reference

  • VXLAN Network with MP-BGP EVPN Control Plane Design Guide: https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/guide-c07-734107.html

  • Configuration and Verification VXLAN with MP-BGP EVPN Control Plane: https://www.cisco.com/c/en/us/support/docs/ip/border-gateway-protocol-bgp/200952-Configuration-and-Verification-VXLAN-wit.html

  • VXLAN EVPN Multi-Site Design and Deployment White Paper: https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-739942.html

  • https://chasewright.com/vxlan-evpn-multisite-setup-part-1/

  • Cisco NX-OS/IOS Multicast Comparison: https://docwiki.advanxer.com/docwiki.cisco.com/wiki/Cisco_NX-OS/IOS_Multicast_Comparison.html

  • NextGen DCI with VXLAN EVPN Multi-Site Using vPC Border Gateways White Paper: https://www.cisco.com/c/en/us/products/collateral/switches/nexus-9000-series-switches/whitepaper-c11-742114.html

  • Cisco Programmable Fabric with VXLAN BGP EVPN Configuration Guide: https://www.cisco.com/c/en/us/td/docs/switches/datacenter/pf/configuration/guide/b-pf-configuration/Introducing-Cisco-Programmable-Fabric-VXLAN-EVPN.html

  • Deploying a Data Center: http://dc.ciscolive.com/pod0/labs/lab1/lab1

  • Cisco Nexus 9000 Series NX-OS VXLAN Configuration Guide, Release 10.2(x): https://www.cisco.com/c/en/us/td/docs/dcn/nx-os/nexus9000/102x/configuration/vxlan/cisco-nexus-9000-series-nx-os-vxlan-configuration-guide-release-102x/m_configuring_vxlan_93x.html

  • VXLAN: https://www.youtube.com/playlist?list=PLDQaRcbiSnqFe6pyaSy-Hwj8XRFPgZ5h8

  • VXLAN Overview - Part 1: https://networkdirection.net/articles/routingandswitching/vxlanoverview/

  • VXLAN BGP EVPN Configuration - Part 6: https://networkdirection.net/articles/routingandswitching/vxlanoverview/vxlanevpnconfiguration/

  • Troubleshooting duplicate IP/MAC in MP-BGP EVPN VxLan on Nexus 9000: https://www.ciscolive.com/c/dam/r/ciscolive/us/docs/2019/pdf/CTHDCN-2304.pdf

  • VXLAN EVPN Multisite Implementation: https://www.youtube.com/watch?v=vMj-aGFjAKM&list=PLpGt4hh32rCrevUCtFL0N2FrTNpCNfIiC&index=14

  • VXLAN: https://rayka-co.com/course/vxlan-evpn/

  • VXLAN EVPN Multisite: https://www.youtube.com/watch?v=vJqwIl2V8GY

  • VXLAN Primer Series: https://www.youtube.com/watch?v=bSiriF8kM7E&list=PLxyr0C_3Ton2-AsrD2iMdQ1mV4bqae8kv&index=1

  • VXLAN Multisite: https://www.youtube.com/watch?v=KFW16GRFMz8&list=PLYzE2pIn57rHznu5eRUT88F5H9LAzrVF5&index=8

  • VXLAN BGP EVPN Multisite: https://www.youtube.com/watch?v=y-ZDCMwEpxw

  • https://datacenteroverlords.com/2022/12/13/in-defense-of-ospf-in-the-underlay-in-some-situations/

  • BGP EVPN Step by Step Configuration Example: https://blog.devopssimplified.com/BGP-EVPN-Step-by-Step

Last updated