VXLAN Underlay Design: BI-DIR PIM, TRM, and Where Ingress Replication Falls Short
If you're designing a VXLAN EVPN fabric, one of the first decisions you'll make is how to handle BUM flooding across the underlay. BUM stands for Broadcast, Unknown unicast, and Multicast. The unknown unicast piece is when a VXLAN Tunnel EndPoint (VTEP) receives a frame destined for a MAC address it doesn't have in its table, it can't make a forwarding decision, so it floods the frame to every VTEP in that broadcast domain and hopes the destination responds if it exists. That response teaches the fabric where the host lives, and future frames get forwarded directly as known unicast. BUM flooding is the mechanism that keeps the fabric from dropping traffic while the control plane catches up. Most guides jump straight to "use Ingress Replication, it's simpler" and honestly, for a lot of deployments that's fine. But if you don't understand why the multicast underlay exists and what TRM adds on top of it, you'll hit a wall the moment a tenant needs L3 multicast. This post walks through BI-DIR PIM, how TRM sits on top of it, and where Ingress Replication actually breaks down.
The Underlay's Job
The VXLAN underlay doesn't know or care about tenant VRFs, VNIs, or what's inside those encapsulated frames. VXLAN's job is to get encapsulated VXLAN packets from one VTEP to another. For known unicast that's straightforward because you should have a route between VTEPs where that host is located. The complexity comes from BUM traffic, which by definition needs to reach every VTEP participating in a given VNI.
You have two choices for how to handle that replication:
- Let the underlay do it via IP multicast (PIM)
- Make the ingress VTEP do it via individual unicast copies (Ingress Replication)
Both solve BUM flooding. They diverge significantly in how they scale and what additional use cases they support.
BI-DIR PIM: How It Actually Works
BI-DIR PIM builds a single shared tree rooted at the Rendezvous Point (RP). Unlike PIM-SM, there's no per-source Shortest Path Tree and no SPT switchover. All traffic flows bidirectionally along the shared tree. This makes it well-suited to the many-to-many replication pattern you get in a VXLAN fabric where any VTEP can be both a source and a receiver. Also, BI-DIR PIM doesn't require each multicast router to know the shortest path back to each (Source (S), Group (G)).
The Designated Forwarder Election
The piece that often trips people up is the Designated Forwarder (DF). The reason why it exists is because on a shared bidirectional tree, multiple routers on the same segment could forward traffic toward the RP simultaneously, creating duplicates if they're done at the same time from different links going towards the RP. The DF election prevents this by designating exactly one router per link per group range as the sole upstream forwarder on that link.
The election is metric-based: each router advertises its cost to reach the RP to neighbors on the segment. The router with the best (lowest cost) route to the RP wins. The DF is the only router allowed to forward traffic upstream toward the RP on that interface. If that sounds familiar, it should. Same principle as a Designated Port elected toward the Root Bridge in Spanning Tree. All routers can still forward downstream toward receivers.
Spine nodes typically have equal-cost paths to the RP (often hosted on the spines themselves via anycast RP). Leaf nodes have the spines as their upstream. One spine will be elected DF per segment, preventing the leaf from receiving duplicate copies of a flooded frame.
VNI-to-Multicast Group Mapping
Each VNI maps to an underlay multicast group. When a VTEP needs to flood a BUM frame for VNI 10100, it encapsulates it and sends it to the corresponding group address. BI-DIR PIM replicates it to every other VTEP that has joined that group.
On NX-OS:
nve 1
member vni 10100
mcast-group 239.1.1.100
member vni 10200
mcast-group 239.1.1.200
This works, but it doesn't scale elegantly. Hundreds of VNIs means hundreds of multicast groups, hundreds of entries in PIM state tables across every router in the underlay. Every time you add a VNI you're adding operational overhead because you now need to map this for your whole fabric.
TRM: Where BI-DIR PIM Becomes Essential
Ingress Replication handles BUM flooding. That's it. It has no mechanism to efficiently handle actual L3 multicast flows within tenant VRFs. If a source in VRF Customer-A sends to 239.10.10.10, IR can't selectively replicate that to only the VTEPs with interested receivers. You'd end up flooding it everywhere or dropping it.
TRM (Tenant Routing and Multicast) solves this by extending the EVPN control plane to handle L3 multicast natively, with the underlay PIM tree doing the actual replication work.
What TRM Adds
VRF-aware RP placement. The RP is configured per tenant VRF. In a VXLAN fabric the RP is typically hosted on a VTEP itself, usually the spines, as an anycast RP shared across multiple nodes for redundancy. Every VTEP learns the RP location for each VRF via BGP EVPN Route Type 5.
BGP EVPN Route Type 6. This is the overlay control plane equivalent of an IGMP join. When a receiver behind VTEP-B sends an IGMP join for group 239.10.10.10 in VRF Customer-A, VTEP-B generates an RT-6 (Selective Multicast Ethernet Tag route) and advertises it to all other VTEPs via BGP. The ingress VTEP now knows exactly which remote VTEPs have interested receivers. No flooding required.
Selective replication. Instead of flooding to every VTEP in the fabric, the ingress VTEP sends multicast traffic only to VTEPs that have signaled interest via RT-6. The underlay BI-DIR PIM tree handles the actual replication of the encapsulated VXLAN packets between those VTEPs.
The Packet Walk
- Receiver behind VTEP-B sends IGMP join for 239.10.10.10 in VRF Customer-A
- VTEP-B advertises BGP EVPN RT-6: "I have a receiver for this group in this VRF"
- Source behind VTEP-A sends multicast to 239.10.10.10
- VTEP-A checks its EVPN table, sees VTEP-B has an interested receiver, encapsulates the packet in VXLAN using the L3 VNI for VRF Customer-A
- VTEP-A sends the encapsulated packet to the underlay multicast group. This is the handoff point between overlay and underlay.
- BI-DIR PIM replicates the packet through the shared tree to all VTEPs that have joined that underlay group
- VTEP-B receives it, strips the VXLAN header, and forwards the inner multicast packet into VRF Customer-A to the waiting receiver
The VTEP is simultaneously a BGP EVPN speaker in the overlay and a PIM router in the underlay. The NVE interface is where both worlds meet on every packet.
Ingress Replication: The Real Downsides
With the multicast model clear, here's where IR actually falls short:
Ingress VTEP bears all replication cost. With PIM, the ingress VTEP sends one copy and the tree handles the rest. This is done normally within your spines that are the RP, which are typically beefier boxes than your leaf switches. With IR, the ingress VTEP sends a separate unicast copy to every other VTEP in the fabric. In a fabric with 50 VTEPs, one BUM frame becomes 49 unicast packets sourced from a single VTEP. Under heavy BUM load this burns real uplink bandwidth and CPU on the ingress node.
Linear scaling problem. The replication burden grows directly with VTEP count. A 20-VTEP fabric handles it fine. A 200-VTEP fabric does not. The multicast tree's replication happens distributed across the network, so the ingress VTEP workload stays constant regardless of scale.
BUM traffic doesn't disappear with EVPN. EVPN suppresses a lot of flooding by pre-populating MAC and IP bindings across all VTEPs via BGP before any traffic flows. So by the time a frame arrives, the destination MAC is usually already known and the unknown unicast flood never happens. That's a win for EVPN as the control plane, but you still get ARP for IPs not yet in the EVPN table, genuine broadcast, and tenant multicast that isn't suppressed. All of that still needs flooding, and with IR it all lands on the ingress VTEP.
No L3 tenant multicast. The fundamental one. IR handles flooding only. Any tenant application running multicast (anything using multicast group addresses at L3) has no efficient path with IR. You need TRM, which needs a multicast underlay.
Config Comparison
### Multicast underlay
nve 1
member vni 10100
mcast-group 239.1.1.100
### Ingress Replication (BGP auto-discovers remote VTEPs)
nve 1
member vni 10100
ingress-replication protocol bgp
IR is simpler to configure and operate. No multicast group planning, no PIM to troubleshoot, BGP handles VTEP discovery automatically. That operational simplicity is why it's the default recommendation for most enterprise fabrics. The above example is just one VNI mapping to a multicast group, but imagine when you have hundreds if not thousands of VNIs that need that 1-to-1 mapping on each VTEP. Without automation this can become a nightmare. In upcoming posts I'll cover Software-Defined Networking (SDN) controllers, which abstract this mapping away entirely.
When to Use Each
The inflection point is tenant multicast. The moment a workload needs L3 multicast across the fabric, IR is off the table and you're configuring PIM whether you want to or not. If you're operating in a large fabric then you need to be aware of the CPU load on the devices within your fabric. The reason being is that your ingress VTEPs, which are likely your leaf switches, are now replicating hundreds if not thousands of frames to each one of the VTEPs.
Everything else being equal, IR is the right default for most enterprise deployments. But understanding the multicast underlay and TRM means you know exactly why it's the default and exactly when to reach for the alternative.
Working through this on the CCNP DC path. The full VXLAN Multi-Site topology lives in CML if you want to see how this maps to actual config.