Project

General

Profile

Actions

Bug #16143

open

Unbound DNS over TLS resumption issue

Added by mrpops2ko . 19 days ago. Updated 13 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
DNS Resolver
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:
Plus Target Version:
Release Notes:
Default
Affected Version:
All
Affected Architecture:
All

Description

hi spent more time than i'd be willing to admit trying to resolve this issue and i'm now largely convinced its an issue of pfsense.

I have set up unbound following the guides in pfsense documentation, having it querying upstream to cloudflare (acting mostly as a forwarder with caching) as per documentation.

Unbound has good stats on recursive breakdowns on query speed and a significant portion of time is spent doing 3 way handshakes.

i've read the unbound documentation https://unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound.conf.html numerous times, I thought tls-session-ticket-keys might be the solution, but it seems that is only for downstream clients, not upstream cloudflare.

Following that i read the github bugs and found https://github.com/NLnetLabs/unbound/issues/47 which they fixed in 2020 for the TLS resumption issue.

From that thread i reduced the num-threads and associated settings down all down to 1

num-threads: 1
msg-cache-slabs: 1
rrset-cache-slabs: 1
infra-cache-slabs: 1
key-cache-slabs: 1

but none of that resolved the issue.

i've been using a few commands here to check and verify the claims, one being netstat -an | grep '\.853' and doing so shows me that the num-threads are being reduced and it helping.

from the github thread, it mentions that the tls resumption should be every 60 seconds but using this to verify, it doesn't seem to be the case in pfsense. the last column is the seconds an ip was sticky to the port (assumingly reusing the connection for dns queries) and we can see from this output its quite random in seconds for when it changes source port and remakes a tcp connection.

[24.11-RELEASE][admin@pfSense.home.arpa]/var/unbound: sh -c 'last=""; count=0; start=$(date +%s); while :; do now=$(date "+%Y-%m-%d %H:%M:%S"); now_s=$(date +%s); port=$(netstat -an | awk '\''/\.853[[:space:]]+ESTABLISHED/ && $5 ~ /1\.1\.1\.1\.853/ { split($4,a,"."); print a[5] }'\'' | sort -u); if [ "$port" != "$last" ] && [ -n "$port" ]; then dur=$((now_s - start)); count=$((count + 1)); start=$now_s; last=$port; else dur=$((now_s - start)); fi; printf "%s %s %s %s\n" "$now" "$port" "$count" "$dur"; sleep 5; done' 2025-04-17 17:15:46 8301 1 0
2025-04-17 17:15:51 8301 1 5
2025-04-17 17:15:56 19597 2 10
2025-04-17 17:16:01 19597 2 5
2025-04-17 17:16:06 19597 2 10
2025-04-17 17:16:11 19597 2 15
2025-04-17 17:16:16 19597 2 20
2025-04-17 17:16:21 19597 2 25
2025-04-17 17:16:26 54507 3 30
2025-04-17 17:16:31 54507 3 5
2025-04-17 17:16:36 54507 3 10
2025-04-17 17:16:41 54507 3 15
2025-04-17 17:16:46 54507 3 20
2025-04-17 17:16:51 54507 3 25
2025-04-17 17:16:56 56437 4 30
2025-04-17 17:17:01 56437 4 5
2025-04-17 17:17:06 56437 4 10
2025-04-17 17:17:11 56437 4 15
2025-04-17 17:17:16 56437 4 20
2025-04-17 17:17:21 56437 4 25
2025-04-17 17:17:26 56437 4 30
2025-04-17 17:17:31 56437 4 35
2025-04-17 17:17:36 56437 4 40
2025-04-17 17:17:41 56437 4 45
2025-04-17 17:17:46 56437 4 50
2025-04-17 17:17:51 56437 4 55
2025-04-17 17:17:56 56437 4 60
2025-04-17 17:18:01 56437 4 65
2025-04-17 17:18:06 56437 4 70
2025-04-17 17:18:11 56437 4 75
2025-04-17 17:18:17 56437 4 81
2025-04-17 17:18:22 56437 4 86
2025-04-17 17:18:27 56437 4 91
2025-04-17 17:18:32 7878 5 96
2025-04-17 17:18:37 7878 5 5
2025-04-17 17:18:42 7878 5 10
2025-04-17 17:18:47 45834 6 15
2025-04-17 17:18:52 45834 6 5
2025-04-17 17:18:57 45834 6 10
2025-04-17 17:19:02 45834 6 15
2025-04-17 17:19:07 60415 7 20
2025-04-17 17:19:12 60415 7 5
2025-04-17 17:19:17 60415 7 10
2025-04-17 17:19:22 60415 7 15
2025-04-17 17:19:27 60415 7 20
2025-04-17 17:19:32 60415 7 25
2025-04-17 17:19:37 60415 7 30
2025-04-17 17:19:42 60415 7 35
2025-04-17 17:19:47 7 40
2025-04-17 17:19:52 27369 8 45
2025-04-17 17:19:57 27369 8 5
2025-04-17 17:20:02 57676 9 10
2025-04-17 17:20:07 57676 9 5
2025-04-17 17:20:12 57676 9 10
2025-04-17 17:20:17 57676 9 15
2025-04-17 17:20:22 57676 9 20
2025-04-17 17:20:27 57676 9 25
2025-04-17 17:20:32 57676 9 30
2025-04-17 17:20:37 63027 10 35
2025-04-17 17:20:42 56027 11 5
2025-04-17 17:20:47 56027 11 5
2025-04-17 17:20:52 56027 11 10
2025-04-17 17:20:57 56027 11 15
2025-04-17 17:21:02 56027 11 20
2025-04-17 17:21:07 56027 11 25
2025-04-17 17:21:12 56027 11 30
2025-04-17 17:21:17 56027 11 35
2025-04-17 17:21:22 56027 11 40
2025-04-17 17:21:27 56027 11 45
2025-04-17 17:21:32 56027 11 50
2025-04-17 17:21:37 56027 11 55
2025-04-17 17:21:42 56027 11 60
2025-04-17 17:21:47 56027 11 65
2025-04-17 17:21:52 56027 11 70
2025-04-17 17:21:57 56027 11 75
2025-04-17 17:22:02 56027 11 80
2025-04-17 17:22:07 12305 12 85
2025-04-17 17:22:12 12305 12 5
2025-04-17 17:22:17 12305 12 10
2025-04-17 17:22:22 12305 12 15
2025-04-17 17:22:27 12305 12 20
2025-04-17 17:22:32 33067 13 25
2025-04-17 17:22:37 33067 13 5
2025-04-17 17:22:42 33067 13 10
2025-04-17 17:22:47 33067 13 15
2025-04-17 17:22:52 1845 14 20
2025-04-17 17:22:57 1845 14 5
2025-04-17 17:23:02 1845 14 10
2025-04-17 17:23:07 1845 14 15
2025-04-17 17:23:12 1845 14 20
2025-04-17 17:23:17 1845 14 25
2025-04-17 17:23:22 1845 14 30
2025-04-17 17:23:28 1845 14 36
2025-04-17 17:23:33 1845 14 41
2025-04-17 17:23:38 1845 14 46
2025-04-17 17:23:43 1845 14 51
2025-04-17 17:23:48 1845 14 56
2025-04-17 17:23:53 1845 14 61
2025-04-17 17:23:58 1845 14 66
2025-04-17 17:24:03 1845 14 71
2025-04-17 17:24:08 1845 14 76
2025-04-17 17:24:13 1845 14 81
2025-04-17 17:24:18 1845 14 86
2025-04-17 17:24:23 1845 14 91
2025-04-17 17:24:28 1845 14 96
2025-04-17 17:24:33 1845 14 101
2025-04-17 17:24:38 1845 14 106
2025-04-17 17:24:43 1845 14 111
2025-04-17 17:24:48 1845 14 116
2025-04-17 17:24:53 1845 14 121
2025-04-17 17:24:58 42876 15 126
2025-04-17 17:25:03 42876 15 5
2025-04-17 17:25:08 42876 15 10
2025-04-17 17:25:13 42876 15 15
2025-04-17 17:25:18 42876 15 20
2025-04-17 17:25:23 42876 15 25
2025-04-17 17:25:28 42876 15 30
2025-04-17 17:25:33 42876 15 35
2025-04-17 17:25:38 4723 16 40
2025-04-17 17:25:43 4723 16 5
2025-04-17 17:25:48 4723 16 10
2025-04-17 17:25:53 4723 16 15
2025-04-17 17:25:58 4723 16 20
2025-04-17 17:26:03 32999 17 25
2025-04-17 17:26:08 61083 18 5
2025-04-17 17:26:13 24879 19 5
2025-04-17 17:26:18 24879 19 5
2025-04-17 17:26:23 24879 19 10
2025-04-17 17:26:28 36799 20 15
2025-04-17 17:26:33 61935 21 5
2025-04-17 17:26:38 61935 21 5
2025-04-17 17:26:43 61935 21 10
2025-04-17 17:26:48 61935 21 15
2025-04-17 17:26:53 61935 21 20
2025-04-17 17:26:58 61935 21 25
2025-04-17 17:27:03 61935 21 30
2025-04-17 17:27:08 61935 21 35
2025-04-17 17:27:13 61935 21 40
2025-04-17 17:27:18 61935 21 45
2025-04-17 17:27:23 61935 21 50
2025-04-17 17:27:28 61935 21 55
2025-04-17 17:27:33 61935 21 60
2025-04-17 17:27:38 61935 21 65
2025-04-17 17:27:43 27449 22 70
2025-04-17 17:27:48 27449 22 5
2025-04-17 17:27:53 27449 22 10
2025-04-17 17:27:58 27449 22 15
2025-04-17 17:28:03 2059 23 20
2025-04-17 17:28:08 2059 23 5
2025-04-17 17:28:13 35989 24 10
2025-04-17 17:28:18 35989 24 5
2025-04-17 17:28:24 61185 25 11
2025-04-17 17:28:29 61185 25 5
2025-04-17 17:28:34 61185 25 10
2025-04-17 17:28:39 61185 25 15
2025-04-17 17:28:44 61185 25 20
2025-04-17 17:28:49 61185 25 25
2025-04-17 17:28:54 61185 25 30
2025-04-17 17:28:59 61185 25 35
2025-04-17 17:29:04 61185 25 40
2025-04-17 17:29:09 61185 25 45
2025-04-17 17:29:14 61185 25 50
2025-04-17 17:29:19 61185 25 55
2025-04-17 17:29:24 61185 25 60
2025-04-17 17:29:29 61185 25 65
2025-04-17 17:29:34 61185 25 70
2025-04-17 17:29:39 61185 25 75
2025-04-17 17:29:44 61185 25 80
2025-04-17 17:29:49 16412 26 85
2025-04-17 17:29:54 16412 26 5
2025-04-17 17:29:59 16412 26 10

additionally i'm using this command to get some metrics on how things are performing.

@[24.11-RELEASE][]/var/unbound: sh c 'unbound-control -c /var/unbound/unbound.conf stats | tee /tmp/unbound.stats.tmp | awk '\''BEGIN { sections["Query Totals & Cache"] = "^total\\.num\\.(queries|cache|prefetch|expired|recursivereplies|queries_timed_out)"; sections["Query Types"] = "^num\\.query\\.type"; sections["Transport Protocols"] = "^num\\.query\\.(tcp|udp|tls|https|ipv6)"; sections["DNSSEC Validation"] = "^num\\.answer\\.(secure|bogus)|^num\\.rrset\\.bogus"; sections["Response Codes"] = "^num\\.answer\\.rcode(\\.|$)|^num\\.answer\\.rcode\\.nodata"; sections["Errors / Threats"] = "^(unwanted|ratelimited|queries_timed_out|total\\.num\\.queries_timed_out)"; sections["Memory & Cache Stats"] = "^mem\\.cache|^cache\\.(count|max_collisions)"; } { for (s in sections) { if ($0 ~ sections[s]) { split($0, kv, "="); if (kv2 + 0 > 0) { if (!(s in seen)) { print "\n\033[1;34m== " s " \033[0m"; seen[s] = 1; } printf "\033[0;36m%-35s:\033[0;33m -10s\033[0m\n", kv[1], kv[2]; } } } }'\''; echo ""; echo -e "\033[1;34m Response Summary \033[0m"; noerror=$(grep "^num\.answer\.rcode\.NOERROR=" /tmp/unbound.stats.tmp | cut -d= -f2); nxdomain=$(grep "^num\.answer\.rcode\.NXDOMAIN=" /tmp/unbound.stats.tmp | cut -d= -f2); servfail=$(grep "^num\.answer\.rcode\.SERVFAIL=" /tmp/unbound.stats.tmp | cut -d= -f2); nodata=$(grep "^num\.answer\.rcode\.nodata=" /tmp/unbound.stats.tmp | cut -d= -f2); totalresp=$(echo "$noerror + $nxdomain + $servfail + $nodata" | bc); printf "\033[0;36mTotal responses:\033[0;33m %s\033[0m\n" "$totalresp"; printf "\033[0;36mNOERROR:\033[0;33m %.2f%\033[0m\n" $(echo "$noerror * 100 / $totalresp" | bc -l); printf "\033[0;36mNXDOMAIN:\033[0;33m .2f%\033[0m\n" $(echo "$nxdomain * 100 / $totalresp" | bc -l); printf "\033[0;36mSERVFAIL:\033[0;33m .2f%\033[0m\n" $(echo "$servfail * 100 / $totalresp" | bc -l); printf "\033[0;36mNODATA:\033[0;33m .2f%\033[0m\n" $(echo "$nodata * 100 / $totalresp" | bc -l); echo ""; echo -e "\033[1;34m Additional Stats \033[0m"; totalq=$(grep "^total.num.queries=" /tmp/unbound.stats.tmp | cut -d= -f2); cachehits=$(grep "^total.num.cachehits=" /tmp/unbound.stats.tmp | cut -d= -f2); cachemiss=$(grep "^total.num.cachemiss=" /tmp/unbound.stats.tmp | cut -d= -f2); prefetch=$(grep "^total.num.prefetch=" /tmp/unbound.stats.tmp | cut -d= -f2); rec=$(grep "^total.num.recursivereplies=" /tmp/unbound.stats.tmp | cut -d= -f2); nonrec=$(echo "$totalq - $rec" | bc); printf "\033[0;36mTotal queries:\033[0;33m s\033[0m\n" "$totalq"; printf "\033[0;36mCache Hit %:\033[0;33m .2f%\033[0m\n" $(echo "$cachehits * 100 / $totalq" | bc -l); printf "\033[0;36mCache Miss %:\033[0;33m %.2f%\033[0m\n" $(echo "$cachemiss * 100 / $totalq" | bc -l); printf "\033[0;36mPrefetch usage %:\033[0;33m %.2f%\033[0m\n" $(echo "$prefetch * 100 / $totalq" | bc -l); printf "\033[0;36mNon-recursive estimate (fast answers):\033[0;33m s (.2f%%)\033[0m\n" "$nonrec" $(echo "$nonrec * 100 / $totalq" | bc -l); printf "\033[0;36mRecursive rate:\033[0;33m .2f%\033[0m\n" $(echo "$rec * 100 / $totalq" | bc -l); echo ""; echo -e "\033[1;34m Latency Histogram Breakdown (recursive queries only) ==\033[0m"; total=$(grep "^total.num.recursivereplies=" /tmp/unbound.stats.tmp | cut -d= -f2); grep "^histogram\\." /tmp/unbound.stats.tmp | awk -v total="$total" -F"[.=]" '\''{ s=$2*1000000+$3; e=$5*1000000+$6; c=$NF; if (c>0 && s>=8192) { ms1=int(s/1000); ms2=int(e/1000); p=(c/total)*100; cum+=p; printf "\033[0;36m%4d4d ms:\033[0m \033[0;33m%6d\033[0m \033[0;36mqueries\033[0m (\033[0;33m.2f%%\033[0m, \033[0;36mcumulative\033[0m: \033[0;33m%.2f%%\033[0m)\n", ms1, ms2, c, p, cum } }'\''; echo ""'

Query Totals & Cache
total.num.queries : 7569
total.num.cachehits : 6036
total.num.cachemiss : 1533
total.num.prefetch : 1845
total.num.expired : 1516
total.num.recursivereplies : 1533 Memory & Cache Stats
mem.cache.rrset : 471479
mem.cache.message : 462974 Query Types
num.query.type.A : 5370
num.query.type.PTR : 12
num.query.type.MX : 7
num.query.type.AAAA : 1627
num.query.type.SRV : 2
num.query.type.SVCB : 12
num.query.type.HTTPS : 539 Transport Protocols
num.query.tcpout : 104 Response Codes
num.answer.rcode.NOERROR : 7501
num.answer.rcode.SERVFAIL : 24
num.answer.rcode.NXDOMAIN : 44
num.answer.rcode.nodata : 1507 Response Summary
Total responses: 9076
NOERROR: 82.65%
NXDOMAIN: 0.48%
SERVFAIL: 0.26%
NODATA: 16.60% Additional Stats
Total queries: 7569
Cache Hit : 79.75
Cache Miss : 20.25
Prefetch usage : 24.38
Non-recursive estimate (fast answers): 6036 (79.75%)
Recursive rate: 20.25% Latency Histogram Breakdown (recursive queries only)
8- 16 ms: 269 queries (17.55%, cumulative: 17.55%)
16- 32 ms: 205 queries (13.37%, cumulative: 30.92%)
32- 65 ms: 345 queries (22.50%, cumulative: 53.42%)
65- 131 ms: 375 queries (24.46%, cumulative: 77.89%)
131- 262 ms: 208 queries (13.57%, cumulative: 91.45%)
262- 524 ms: 48 queries (3.13%, cumulative: 94.59%)
524-1000 ms: 6 queries (0.39%, cumulative: 94.98%)
1000-2000 ms: 3 queries (0.20%, cumulative: 95.17%)@

ideally i'd be expecting or it would be my assumption that if the TCP connection was being reused properly with TLS resumption, then a good 80% of connections would fall into the first two categories.

When doing this testing, i reduced the settings i mentioned and reduced the upstream DNS server to a single cloudflare 1.1.1.1 server. Doing so has improved them, but its not nearly what I would have expected.

The following below stats are from when I used both 1.1.1.1 and 1.0.0.1 alongside 4 num-threads and associated settings.

@== Query Totals & Cache
total.num.queries : 429496
total.num.cachehits : 418166
total.num.cachemiss : 11330
total.num.prefetch : 115537
total.num.expired : 91504
total.num.recursivereplies : 11326

Memory & Cache Stats
mem.cache.rrset : 2062702
mem.cache.message : 2301061 Query Types
num.query.type.A : 341294
num.query.type.CNAME : 36
num.query.type.SOA : 2
num.query.type.PTR : 276
num.query.type.MX : 347
num.query.type.TXT : 60
num.query.type.AAAA : 70299
num.query.type.SRV : 36
num.query.type.SVCB : 523
num.query.type.HTTPS : 16623 Transport Protocols
num.query.tcp : 3
num.query.tcpout : 36730 Response Codes
num.answer.rcode.NOERROR : 425422
num.answer.rcode.SERVFAIL : 1695
num.answer.rcode.NXDOMAIN : 2375
num.answer.rcode.nodata : 58055 Response Summary
Total responses: 487547
NOERROR: 87.26%
NXDOMAIN: 0.49%
SERVFAIL: 0.35%
NODATA: 11.91% Additional Stats
Total queries: 429496
Cache Hit : 97.36
Cache Miss : 2.64
Prefetch usage : 26.90
Non-recursive estimate (fast answers): 418170 (97.36%)
Recursive rate: 2.64% Latency Histogram Breakdown (recursive queries only) ==
8- 16 ms: 1453 queries (12.83%, cumulative: 12.83%)
16- 32 ms: 1343 queries (11.86%, cumulative: 24.69%)
32- 65 ms: 2507 queries (22.13%, cumulative: 46.82%)
65- 131 ms: 3297 queries (29.11%, cumulative: 75.93%)
131- 262 ms: 1560 queries (13.77%, cumulative: 89.71%)
262- 524 ms: 289 queries (2.55%, cumulative: 92.26%)
524-1000 ms: 35 queries (0.31%, cumulative: 92.57%)
1000-2000 ms: 23 queries (0.20%, cumulative: 92.77%)@
Actions #1

Updated by mrpops2ko . 18 days ago

i've crossposted it to unbound github too https://github.com/NLnetLabs/unbound/issues/1271

Actions #2

Updated by mrpops2ko . 13 days ago

further to this it turns out its a setting which needed tweaking

specifically;
infra-cache-min-rtt: 750

but i've also included

edns-tcp-keepalive: yes
edns-tcp-keepalive-timeout: 180000
max-reuse-tcp-queries: 9000

which helps. i think some of these settings should be mentioned in the pfsense documentation for unbound as with the default settings it will cause connections to constantly drop and then you face the 3 way handshake over and over again

Actions

Also available in: Atom PDF