slow iptables, reverse dns

2024-04-20 #linux #networking

Last week, I investigated an issue¹ where iptables -L (the command to list firewall rules in Linux) took over 60 seconds to complete. This was surprising because the machine was using very little CPU and there were only a few firewall rules. So why was it so slow?

The Linux kernel stores firewall rules in memory. From userspace, iptables retrieves the rules from the kernel via netlink. After retrieving the rules, iptables simply writes them to stdout. None of this requires disk or external network access… or so I thought.

One of my coworkers found some posts on the Internet saying that iptables -L uses reverse DNS to resolve IP addresses in the firewall rules to hostnames. A lot of these posts were very old, so we weren’t sure at first if this was still true in 2024.² But then I checked the iptables code and found a call to getnameinfo, which is invoked by both iptables-legacy and iptables-nft.³

To confirm this theory, we tried passing the -n (numeric) flag to disable reverse DNS lookup, like this:

iptables -nL

and sure enough with -n the command completed in a fraction of a second.

In a correctly configured system, reverse DNS should be fast. Even if the DNS server fails to resolve the IP to a hostname, it should at least respond with NXDOMAIN or SERVFAIL so the client doesn’t wait on a response. However, if the DNS server never responds, or responds slowly, then the client will wait, retry, and eventually timeout.

It’s easy to reproduce the slowdown using Linux network namespaces. Create two network namespaces, one for the DNS client and one for the DNS server, connected by a virtual Ethernet (veth) pair. To simulate an unresponsive DNS server, configure the firewall in the server namespace to drop all inbound traffic.⁴

Diagram showing client and server network namespaces attached by a veth pair. The client has IP address 192.168.99.1 and the server has IP address 192.168.99.2. The client has /etc/resolv.conf configured with nameserver 192.168.99.2.

# Override resolv.conf inside the client netns to control the client's nameserver.
# Also override /etc/nsswitch.conf to prevent the DNS client from using systemd-resolved.
mkdir -p /etc/netns/client
echo "nameserver 192.168.99.2" > /etc/netns/client/resolv.conf
echo "hosts: dns" > /etc/netns/client/nsswitch.conf

# Create network namespaces for the DNS client and server.
ip netns add "client"
ip netns add "server"

# Create veth pair linking DNS client and server netns.
# Assign the client IP 192.168.99.1 and the server IP 192.168.99.2.
ip -n "client" link add dev "eth0" type veth peer name "eth0" netns "server"
ip -n "client" addr add dev "eth0" "192.168.99.1/24"
ip -n "client" link set dev "eth0" up
ip -n "server" addr add dev "eth0" "192.168.99.2/24"
ip -n "server" link set dev "eth0" up

# Configure the firewall on the server to drop everything in the INPUT chain.
# The server won't send any reply to the client, not even an ICMP "port unreachable".
ip netns exec "server" iptables -P INPUT DROP

# Add some firewall rules on the client.
# We could use any IP address in the rules, so arbitrarily choose IPs from 10.0.0.0/8.
ip netns exec "client" iptables -N TEST
for i in {1..3}; do
    ip netns exec "client" iptables -A TEST -p tcp -d 10.0.0.$i -j ACCEPT
done

Then time the iptables list command:

[root@fedora]# time ip netns exec client iptables -L TEST
Chain TEST (0 references)
target     prot opt source               destination
ACCEPT     tcp  --  anywhere             10.0.0.1
ACCEPT     tcp  --  anywhere             10.0.0.2
ACCEPT     tcp  --  anywhere             10.0.0.3

real    1m0.069s
user    0m0.001s
sys     0m0.004s

Here’s the packet capture from the client side:

[root@fedora]# ip netns exec client tcpdump -nv udp port 53
dropped privs to tcpdump
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
09:59:22.139083 IP (tos 0x0, ttl 64, id 8849, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.99.1.45598 > 192.168.99.2.domain: 4498+ PTR? 1.0.0.10.in-addr.arpa. (39)
09:59:27.144524 IP (tos 0x0, ttl 64, id 8850, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.99.1.45598 > 192.168.99.2.domain: 4498+ PTR? 1.0.0.10.in-addr.arpa. (39)
09:59:32.150005 IP (tos 0x0, ttl 64, id 18829, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.99.1.39721 > 192.168.99.2.domain: 22173+ PTR? 1.0.0.10.in-addr.arpa. (39)
09:59:37.155323 IP (tos 0x0, ttl 64, id 18830, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.99.1.39721 > 192.168.99.2.domain: 22173+ PTR? 1.0.0.10.in-addr.arpa. (39)
09:59:42.160757 IP (tos 0x0, ttl 64, id 42006, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.99.1.42513 > 192.168.99.2.domain: 18542+ PTR? 2.0.0.10.in-addr.arpa. (39)
09:59:47.166116 IP (tos 0x0, ttl 64, id 42007, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.99.1.42513 > 192.168.99.2.domain: 18542+ PTR? 2.0.0.10.in-addr.arpa. (39)
09:59:52.171543 IP (tos 0x0, ttl 64, id 46047, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.99.1.43565 > 192.168.99.2.domain: 15599+ PTR? 2.0.0.10.in-addr.arpa. (39)
09:59:57.176895 IP (tos 0x0, ttl 64, id 46048, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.99.1.43565 > 192.168.99.2.domain: 15599+ PTR? 2.0.0.10.in-addr.arpa. (39)
10:00:02.182315 IP (tos 0x0, ttl 64, id 12270, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.99.1.45392 > 192.168.99.2.domain: 29590+ PTR? 3.0.0.10.in-addr.arpa. (39)
10:00:07.187633 IP (tos 0x0, ttl 64, id 12271, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.99.1.45392 > 192.168.99.2.domain: 29590+ PTR? 3.0.0.10.in-addr.arpa. (39)
10:00:12.193058 IP (tos 0x0, ttl 64, id 22543, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.99.1.35956 > 192.168.99.2.domain: 53102+ PTR? 3.0.0.10.in-addr.arpa. (39)
10:00:17.198478 IP (tos 0x0, ttl 64, id 22544, offset 0, flags [DF], proto UDP (17), length 67)
    192.168.99.1.35956 > 192.168.99.2.domain: 53102+ PTR? 3.0.0.10.in-addr.arpa. (39)

The client sends PTR queries to resolve each IP to a hostname, waiting in vain for responses that never arrive. Notice that the client queries each IP address four times (initial query plus three retries), with a five second timeout between retries:

3 IPs x 4 queries/IP x 5 seconds/query = 60 seconds

iptables isn’t the only Linux networking tool that uses reverse DNS by default: tcpdump and traceroute do this too. It’s a curious decision, because these are often the very tools used to debug DNS problems.⁵

In conclusion, if iptables -L is slow, try iptables -nL instead to skip reverse DNS lookups!

This eventually led to a one-character bugfix in Azure CNI. ↩︎
I later discovered that this behavior is documented in the man page for iptables under the section about -L: “Please note that it is often used with the -n option, in order to avoid long reverse DNS lookups.” ↩︎
The call path in iptables-nft v1.8.9 is nft_ipv4_print_rule → print_ipv4_addresses → ipv4_addr_to_string → xtables_ipaddr_to_anyname → ipaddr_to_host → getnameinfo. ↩︎
Without the inbound DROP rule, the Linux kernel will send the client an ICMP “port unreachable” packet because the server netns has no process listening on port 53. The DNS client will see this and stop waiting for a response, so iptables -L still completes relatively quickly. ↩︎
To their credit, the netfilter developers fixed this in nft, the successor of iptables. By default, nft list won’t perform reverse DNS lookups unless given the -N flag. ↩︎