asymmetric routing around the firewall

This is a follow-up to my previous post about some mysterious TCP connection timeouts in the UC Berkeley wired network.

I received many thoughtful emails in response to that post, and there was an excellent discussion on Hacker News. I’m very grateful to everyone who spent time helping with my ridiculous home networking issues.

Since then, I’ve learned some new information that (mostly) solves the mystery. In this post, I’ll first summarize the main theories people suggested, then tell the story of how a Bunny CDN engineer came to the rescue, and finally describe the root cause. If you just want to know the answer, please skip to the end.

Table of contents

Theories

Theory #1: MTU

Many people suggested that packets might be larger than the maximum transmission unit (MTU) somewhere along the path to Bunny CDN. However, I don’t find this theory plausible because:

Theory #2: Firewall rejects SNI

Some people suggested that a firewall might be rejecting the server name indication (SNI) in the TLS Client Hello sent immediately after the client ACK.

I mentioned briefly in the post that I was seeing the same behavior with HTTP, not just HTTPS, but didn’t provide a packet capture. To clarify, when I say the issue happens with HTTP, I mean the exact same symptoms showing a dropped client ACK. There is no HTTP redirect to the HTTPS site, because the TCP connection is never successfully established.

Additionally, I discovered later that other Bunny CDN IPs worked correctly. For example, Bunny CDN uses IP 107.182.163.162 located in Utah, and I’m able to connect with

curl https://fonts.bunny.net --resolve fonts.bunny.net:443:107.182.163.16

even though this command sends the exact same SNI (fonts.bunny.net).

Several people noticed that the destination IP address 169.150.221.147 is very similar to the link local address range 169.254.0.0/16 (see section 2.1 in RFC 3927). The theory is that someone accidentally configured a firewall rule to block 169.0.0.0/8 instead of 169.254.0.0/16, inadvertently blocking the Bunny CDN IP address 169.150.221.147.

I didn’t say it in the original post, but I see the same behavior with another Bunny CDN IP 143.244.50.88, which isn’t in the 169.0.0.0/8 address space. So unfortunately this theory doesn’t quite fit either.

Theory #4: Misconfigured firewall drops outbound established connections

One person reproduced the same symptoms in their home network by adding a firewall rule to drop packets to 169.150.221.147. The trick was to configure the rule only in the outbound direction and only for established connections. This causes the client ACK to be dropped, and the packet captures show exactly the same pattern of retransmitted ACKs (from the client) and SYN+ACKs (from the server).

This theory aligns really well with the symptoms I’m observing . However, it seemed unlikely someone would configure a rule to block a connection after it’s established, and Berkeley IT told me there were no firewall rules blocking destinations 169.150.221.147 or 143.244.50.88.

Theory #5: Asymmetric routing

This was the guess I made in the original post, and a few people agreed with me. Someone suggested checking the IP header TTL field or using traceroute with different ports/probes. I learned about some traceroute flags I’d never used before; for example

sudo traceroute 169.150.221.147 -p 443 -q 1 -T -O syn

sends a TCP probe on port 443 with the SYN flag set. Unfortunately, I didn’t find any evidence in the IP TTL field or traceroute output to prove that the timeouts were caused by asymmetric routing.

As you can probably guess from the title of this post, I was eventually able to confirm this theory, but not without some help from an engineer at Bunny CDN.

Help from a Bunny CDN engineer

In response to my post, an infrastructure engineer from Bunny CDN offered to investigate. I can’t express enough how grateful I am that someone at Bunny CDN spent time helping to solve this mystery!

The Bunny CDN engineer suggested testing the following IPs to narrow down the issue:

IP AddressLocationHosting Provider
169.150.221.147San JoseDataPacket
143.244.50.88Los AngelesDataPacket
107.182.163.162UtahWebNX

The two IPs hosted by DataPacket timed out, but I could connect successfully to the IP hosted by WebNX.1

The engineer also offered to take a packet capture from a Bunny CDN server. I ran this script on a machine in my apartment:

while true; do
  curl -vvv -4 https://fonts.bunny.net \
    --connect-timeout 10 \
    --no-progress-meter -D - -o /dev/null;
  sleep 10;
done

and he took the packet capture filtering for my public IP address. The packet captures showed:

This confirmed that the client ACK was being dropped somewhere before reaching Bunny CDN, almost certainly within the UC Berkeley network.

Finally, the Bunny CDN engineer used mtr to trace the path from 169.150.221.147 back to my IP address. This turned out to be the critical clue, as described in the next section.

(Mostly) solving the mystery

Tracing the path in the outbound direction from my IP to Bunny CDN, I noticed what looked like a firewall, “reshall-fw–ethernet1-21-682.sait-west.berkeley.edu”:

HOST: fedora
  1.|-- _gateway                                                 0.0%     3    1.5   2.8   1.5   5.5   2.3
  2.|-- ucv-cdf-r1--irb-525.net.berkeley.edu                     0.0%     3    2.4  26.0   2.4  60.8  30.8
  3.|-- sut-mdc-cr1--xe-1-1-11.net.berkeley.edu                  0.0%     3   20.2  12.8   5.6  20.2   7.3
  4.|-- sut-mdc-sr7--irb-204.net.berkeley.edu                    0.0%     3    4.4   7.5   4.4  12.5   4.3
  5.|-- reshall-fw--ethernet1-21-682.sait-west.berkeley.edu      0.0%     3    6.8   5.1   3.1   6.8   1.9
  6.|-- sut-mdc-sr7--irb-199.net.berkeley.edu                    0.0%     3    3.6   3.6   3.4   3.7   0.1
  7.|-- reccev-cev-cr1--et-0-0-3.net.berkeley.edu                0.0%     3    5.0   5.2   4.1   6.4   1.2
  8.|-- reccev-cev-br1--et-1-1-1.net.berkeley.edu                0.0%     3   45.5  24.6   3.7  45.5  20.9
  9.|-- emvl1-agg-01--ucb--100g.cenic.net                        0.0%     3    4.2   4.7   3.0   7.1   2.1
 10.|-- sacr2-agg-01--emvl1-agg-01--400g--01.cenic.net           0.0%     3    7.9   7.4   5.4   8.8   1.7
 11.|-- hundredge-0-0-0-24.98.core2.sacr.net.internet2.edu       0.0%     3   11.5  13.9  10.9  19.3   4.7
 12.|-- fourhundredge-0-0-0-0.4079.core2.sunn.net.internet2.edu  0.0%     3    9.0   8.8   8.3   9.0   0.4
 13.|-- fourhundredge-0-0-0-49.4079.agg2.sanj.net.internet2.edu  0.0%     3   11.3  11.2  10.8  11.4   0.3
 14.|-- 162.252.69.142                                           0.0%     3    7.5   7.6   7.5   7.7   0.1
 15.|-- be-2111-cs01.9greatoaks.ca.ibone.comcast.net             0.0%     3   10.5   9.0   8.1  10.5   1.3
 16.|-- be-2113-pe13.9greatoaks.ca.ibone.comcast.net             0.0%     3    7.1   8.6   7.1   9.9   1.4
 17.|-- 71.25.198.98                                             0.0%     3    7.0   7.3   7.0   7.6   0.3
 18.|-- vl201.sjc-eq10-dist-1.cdn77.com                          0.0%     3    7.6   8.8   7.6  10.2   1.3
 19.|-- 169-150-221-147.bunnyinfra.net                           0.0%     3    7.9   7.6   7.1   7.9   0.4

When the Bunny CDN engineer traced the path in the reverse direction (Bunny CDN back to my IP), he saw this:

HOST: edge-915.bunnyinfra.net                      Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- unn-169-150-221-156.datapacket.com            0.0%     3    0.5   0.5   0.5   0.5   0.0
  2.|-- vl202.sjc-eq10-core-2.cdn77.com               0.0%     3    0.4   0.5   0.4   0.5   0.0
  3.|-- vl250.sjc-eq10-core-1.cdn77.com               0.0%     3    0.6   0.6   0.6   0.6   0.0
  4.|-- be-111-pe13.9greatoaks.ca.ibone.comcast.net   0.0%     3    0.8   1.0   0.8   1.1   0.1
  5.|-- be-2113-cs01.9greatoaks.ca.ibone.comcast.net  0.0%     3    0.9   1.2   0.9   1.4   0.2
  6.|-- be-36311-ar01.hayward.ca.sfba.comcast.net     0.0%     3    2.3   2.3   2.3   2.3   0.0
  7.|-- be-398-rar01.pleasanton.ca.sfba.comcast.net   0.0%     3    2.7   2.8   2.7   3.0   0.1
  8.|-- be-12-sur04.pinole.ca.sfba.comcast.net        0.0%     3    3.6   3.6   3.6   3.6   0.0
  9.|-- ???                                          100.0     3    0.0   0.0   0.0   0.0   0.0
 10.|-- sut-mdc-cr1--xe-1-0-5.net.berkeley.edu        0.0%     3    6.3   5.6   5.1   6.3   0.6
 11.|-- ucv-cdf-r1--irb-525.net.berkeley.edu          0.0%     3   12.1  10.0   6.0  12.1   3.4
 12.|-- ???                                          100.0     3    0.0   0.0   0.0   0.0   0.0

Comparing the outbound and inbound paths, I noticed something strange: inbound traffic from 169.150.221.147 was bypassing the firewall!

Diagram showing outbound traffic traversing the firewall but inbound traffic bypasses the firewall

This perfectly explains the dropped client ACK and TCP connection timeouts:

  1. Outbound client SYN goes through the firewall.
  2. Inbound server SYN+ACK is routed back to the client without going through the firewall.
  3. Client sends ACK to complete the TCP handshake.
  4. Since the firewall never saw the server SYN+ACK, it drops the client ACK.
  5. Since the server never receives the (dropped) client ACK, the TCP handshake never completes.

I don’t know which router the “???” represents, but I’m fairly certain inbound traffic isn’t supposed to skip the firewall.2 This is most likely caused by a route table misconfigured somewhere in the UC Berkeley network. Unfortunately, I cannot investigate further without the cooperation of Berkeley IT.

Conclusion

Will Berkeley IT fix the routing misconfiguration? I’ve done everything in my power to escalate this to the appropriate team, including directly emailing the Executive Director of Campus IT.3 Since that email, Berkeley IT has at least stopped trying to close the ticket (good!) but assigned it low priority because I’m the only person who has reported the problem.4 Later, when I realized that inbound traffic was bypassing the firewall, I notified UC Berkeley’s Information Security Office of the potential security vulnerability, but their response was somewhat lacking in urgency.5 So we’ll see.

I want to again thank everyone who took the time to respond to my last post. Even though the issue hasn’t been fixed, I feel better knowing why it’s happening, and I learned a lot from all the feedback. I especially want to thank the infrastructure engineer at Bunny CDN for providing the crucial clue that solved the mystery!

Finally, a few people suggested using a VPN as a workaround, which makes a lot of sense. However, I realized there’s one workaround no one suggested, and it happens to be the one I’m actually pursuing (albeit for unrelated reasons): moving to a new home where I can manage the network myself.

Update (2024-04-12): Today, Berkeley IT told me, “It appears that the Xfinity link present only on the Reshall networks has introduced a condition that has contributed to the asymmetrical routing situation,” and they’re working on fixing it. So now I can say that the mystery is completely solved!


  1. I later discovered two other IPs from DataPacket, not used by Bunny CDN, that showed the exact same symptoms: www.datapacket.com at IP 185.152.67.7 and assets.gentoo.org at IP 156.146.53.32. ↩︎

  2. The Bunny CDN engineer also shared the mtr output from IP 107.182.163.162 (hosted by WebNX) to my IP. That path correctly included the firewall, which explains why I was able to connect successfully. ↩︎

  3. I went up the org chart until I found someone who had both a technical background and publicly listed email address. To his credit, the Executive Director responded within a day and forwarded the email to the network services team. ↩︎

  4. Berkeley IT at least acknowledged in their response that other people may have seen this issue without reporting it. Anecdotally, when a page fails to load, people tend blame the website rather than the network. For example, after my wife saw repeated timeouts on ravelry.com, she believed for months that the company had simply gone out of business. ↩︎

  5. Full text of the response from security@berkeley.edu: “Good afternoon. Thank you for the alert. Please keep us posted on the Network Services investigation. If this is determined to be a Security issue, let me ask you to send us a reply. Thank you.” ↩︎