These are chat archives for CZ-NIC/knot-resolver

7th
Feb 2019
Petr Špaček
@pspacek
Feb 07 08:57
@micah Hmm, that needs further investigation. First of all make sure you have the very last version (3.2.1), there were important fixes in the last version.
If you are on the latest version then we will need complete verbose log starting before the issue occurred so we can see if there is anything obviously bad in the previous workflow.
micah
@micah
Feb 07 12:30
@pspacek definitely have 3.2.1-1 from your repos.
These resolvers are being used by a very active mail server, sending and receiving a million emails a day, so a verbose log may be .... challenging
its definitely happening again right now. One of the servers takes about 10 seconds before responding, then gives SERVFAIL, the other responds right away with NXDOMAIN (the correct response)
Petr Špaček
@pspacek
Feb 07 12:34
This is very hard to debug without more information, sorry.
micah
@micah
Feb 07 12:35
This sounds exactly like the problem I had with 'knot'
Petr Špaček
@pspacek
Feb 07 12:35
At very least you can enable verbose log on the "broken" instance so we can see what is happening there.
micah
@micah
Feb 07 12:35
The problem (with knot-dns) was 'solved' with me disabling ipv6
Petr Špaček
@pspacek
Feb 07 12:35
If we have enough luck we will not need verbose log from past...
micah
@micah
Feb 07 12:35
so I could try to do that and see if it is the same issue
Petr Špaček
@pspacek
Feb 07 12:36
The code bases are different so I'm not really sure ... but in any case, if you do not have functional IPv6 you should have net.ipv6 = false in your kresd config.
micah
@micah
Feb 07 12:37
I do have functional ipv6... I'll try net.ipv6 = false and see if it makes the problem go away or not
i have to go afk for a few hours, so I've set that and will see the results when I return. I can do more debugging when I return
Vladimír Čunát
@vcunat
Feb 07 12:41
If you know the names, it would be possible to look at those first through DnsViz.net Sometimes it happens e.g. that some of authoritative servers get some answers wrong.
(But of course, the problem may be at knot-resolver side, too.)
micah
@micah
Feb 07 16:47
ok, I tried running with net.ipv6 = false but it did not solve the problem. An example query would be: dig @10.0.1.175 148.18.173.131.truncate.gbudb.net
which at the moment, produces a SERVFAIL
when I restart kresd, then I get a NXDOMAIN (which is expected)
I'm not sure what EDNS is, but they seem to be doing it?
Vladimír Čunát
@vcunat
Feb 07 17:56

For me DnsViz shows that the authoritative servers return SERVFAIL sometimes.

148.18.173.131.truncate.gbudb.net/A: The response had an invalid RCODE (SERVFAIL). (185.87.186.181, UDP_-NOEDNS)

micah
@micah
Feb 07 17:57
@vcunat yes, I see that also, and it seems it is because of NOEDNS
which I do not understand what that is
I do agree they are failing here, but does that mean that knotd should return a SERVFAIL for it?
Vladimír Čunát
@vcunat
Feb 07 18:20
Only one of five servers in the set does this ATM, apparently regardless of what question is asked (or whether EDNS is sent). Knot-resolver didn't hit that address for me in a dozen retries, but IIRC there is some related sub-optimality in the algorithm choosing which IP to ask.
micah
@micah
Feb 07 18:27
like maybe knot, when it hits that bad server, should try another instead?
Vladimír Čunát
@vcunat
Feb 07 18:42
Yes, SERVFAILs certainly should lead to retries with other IPs. There might be a larger bug than I thought originally.