Chasing an Intermittent MTU Black Hole Across a VXLAN Overlay

Some outages announce themselves. This one whispered. Large file copies between two data-centre racks would stall at exactly the same point, SSH sessions were fine, ping was fine, and a curl of a small API endpoint was fine — but pull a container image or run a database backup across the fabric and the transfer would hang and eventually time out. Classic MTU black hole behaviour, made harder to see because it lived underneath a VXLAN overlay.

This is the write-up I wish I’d had that afternoon: the reasoning, the three commands that mattered, and the single packet capture that ended the argument.

The symptom, stated precisely

The important detail with an MTU black hole is that connectivity works and throughput doesn’t. Small packets sail through. Large packets — specifically anything that needs to be fragmented or that exceeds the path MTU with the Don’t Fragment bit set — vanish silently. No ICMP error comes back, so Path MTU Discovery (PMTUD) never learns to back off, and the sender keeps retransmitting full-size segments into a hole.

Under a VXLAN overlay this gets sneaky. VXLAN wraps the original Ethernet frame in an outer UDP/IP header, adding 50 bytes of encapsulation (54 with a dot1q tag):

Outer Ethernet (14) + Outer IP (20) + UDP (8) + VXLAN (8) = 50 bytes

So a tenant sending a standard 1500-byte frame produces a 1550-byte packet on the underlay. If the underlay isn’t configured to carry that, the overlay looks healthy right up until someone sends a full-size packet.

Reproducing it on demand

The first job in any intermittent problem is to make it non-intermittent. I don’t trust a fix I can’t first trigger reliably. From a host in the tenant network I forced large, unfragmentable packets at the far end:

# Linux: DF bit set, payload sized to fill a 1500-byte L2 frame
ping -M do -s 1472 10.20.10.5

PING 10.20.10.5 (10.20.10.5) 1472(1500) bytes of data.
--- 10.20.10.5 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss

Then I walked the size down:

ping -M do -s 1422 10.20.10.5   # 1450-byte frame -> success
ping -M do -s 1472 10.20.10.5   # 1500-byte frame -> 100% loss

That’s the whole diagnosis in two lines. Anything at or below a 1450-byte tenant frame worked; a normal 1500-byte frame did not. The overlay was eating exactly the encapsulation overhead. The remaining question was where.

Where the bytes go: budgeting the MTU

Before touching a device I wrote the byte budget down, because guessing is how you waste an afternoon:

Tenant frame the host wants to send:      1500
VXLAN encapsulation overhead:             + 50
Required underlay (jumbo) MTU:            1550  (minimum)

The rule I hold onto: the underlay MTU must be at least the tenant MTU plus 50. In practice you don’t set it to 1550 and call it done — you set the underlay to a proper jumbo value (9000, or 9216 on many switches) so the overlay has headroom and you never have this conversation again. The tenant/overlay MTU then sits comfortably underneath.

The three commands that mattered

I traced the underlay path hop by hop. On a Nexus fabric the interesting state lives in three places.

First, the physical interface MTU on the underlay links:

switch# show interface Ethernet1/1 | include MTU
  MTU 1500 bytes, BW 100000000 Kbit

There it was on one leaf: MTU 1500. Every other underlay link in the path reported MTU 9216. A single interface had been re-provisioned during a line-card swap and never had the jumbo MTU reapplied.

Second, confirm what the system-level QoS / network-qos policy actually programmed, because on Nexus the effective MTU for L3 can be driven by the network-qos policy, not just the interface line:

switch# show queuing interface Ethernet1/1 | include -i mtu
    MTU: 1500

Third — and this is the one people skip — confirm the VTEP source interface and its MTU, since that’s the interface actually building the encapsulated packets:

switch# show nve interface nve1 detail | include -i "MTU|Source"
  Source-Interface: loopback1 (primary: 10.0.0.11)
  ...

The loopback itself has no MTU concern, but the uplinks carrying its traffic do — and one of them was the 1500-byte offender.

The packet capture that ended the argument

Interface counters are suggestive; a capture is proof. Rather than SPAN the whole thing, I set an ACL-based capture on the suspect uplink for the underlay VTEP-to-VTEP traffic and looked at the outer header. On the sending VTEP the encapsulated frames were leaving at 1550 bytes. On the offending hop, the oversized frames were being dropped, and — the key point — no ICMP “fragmentation needed” was generated back toward the source, because the inner DF handling plus the overlay meant PMTUD had nothing to work with.

Reading the capture, the story was unambiguous:

No.  Source        Destination   Proto  Length  Info
1    10.0.0.11     10.0.0.12     UDP    1550    4789 -> ... (VXLAN)   [seen ingress]
2    (none egress on Eth1/1)                                          [dropped]

Frame in, no frame out, no ICMP back. That is the textbook signature of a black hole rather than ordinary congestion or a routing loop. Congestion drops some frames; a black hole drops a class of frames deterministically.

The fix, and why it was boring

The fix was a single interface:

switch# configure terminal
switch(config)# interface Ethernet1/1
switch(config-if)# mtu 9216
switch(config-if)# end
switch# copy running-config startup-config

Re-running the forced ping immediately succeeded at 1500-byte frames, and the stalled transfers completed. Boring fixes are the goal. The value wasn’t the command — it was getting to the command quickly instead of rebooting things and hoping.

What I changed afterward

An outage you fix but don’t systematise is an outage you’ll see again. Three things went into the runbook:

The first was a standing verification that every underlay interface carries the jumbo MTU, expressed as something a script can check rather than something a human remembers. A quick loop over show interface ... | include MTU across the fabric flags any interface that isn’t at the fabric standard.

The second was a synthetic large-packet probe between representative hosts in each tenant, run on a schedule, alerting if a DF-set full-size packet stops getting through. MTU black holes are invisible to normal reachability monitoring precisely because small packets work — so the monitoring has to send big ones on purpose.

The third was a note in the change template for any line-card or interface work: reapply the fabric MTU and prove it with a large-packet test before closing the change. The original break came from a hardware swap, and the cheapest place to catch it is the change itself.

Takeaways

If throughput dies but reachability lives, suspect MTU before you suspect anything glamorous. Under an overlay, budget the encapsulation overhead explicitly — VXLAN costs you 50 bytes, so the underlay has to be jumbo. Reproduce with ping -M do -s <size> so you’re working with a fact instead of a feeling. And when interface counters and opinions disagree, capture the traffic and read the outer header: frame in, no frame out, no ICMP back is the fingerprint of a black hole.

The whole incident came down to one interface at 1500 in a fabric built for 9216. The hard part was never the fix. It was trusting the byte budget enough to go find it.

Share this:

Leave a Reply

Your email address will not be published. Required fields are marked *