These are chat archives for CZ-NIC/knot-resolver

12th
Dec 2018
micah
@micah
Dec 12 2018 14:41
I'm running the 3.1.0-1 package from the knot-resolver.cz/download page, and for some reason it stopped resolving last night, is there something I can do to query to find out why?
I'm getting a SERVFAIL when I dig @127.0.01 <domain>
Petr Špaček
@pspacek
Dec 12 2018 14:42
Hmm, very weird. Do you have verbose logs enabled?
micah
@micah
Dec 12 2018 14:42
i do not
Petr Špaček
@pspacek
Dec 12 2018 14:42
Okay, let me think. Do not restart the resolver ...
micah
@micah
Dec 12 2018 14:42
i have two of them, and both are in this state
Petr Špaček
@pspacek
Dec 12 2018 14:42
That's even weirder!
Command systemctl status kresd-control@1 should print (among other things) print path to socket somewhere in /var/run/... Can you get that path?
micah
@micah
Dec 12 2018 14:44
got that
Petr Špaček
@pspacek
Dec 12 2018 14:44
It is most likely systemctl status kresd-control@1.socket
Ok
Now make sure you have command socat installed.
micah
@micah
Dec 12 2018 14:44
Listen: /run/knot-resolver/control@1 (Stream)
Petr Špaček
@pspacek
Dec 12 2018 14:44
yeah, that's it
micah
@micah
Dec 12 2018 14:45
ok, i've got socat
Petr Špaček
@pspacek
Dec 12 2018 14:45
Fine. Command socat - unix-connect:/run/knot-resolver/control@1 will give you interactive console for the running process.
You should be able to type in verbose(true) and press enter. That will enable verbose logging.
After that journalctl -f kresd@1 should give us more information. (Mistake, it should be journalctl -f -u kresd@1)
micah
@micah
Dec 12 2018 14:46
yep, was able to do that
Petr Špaček
@pspacek
Dec 12 2018 14:46
I would recommend to run journalctl and then do a problematic DNS query.
Then you can copy&paste the vebose log here.
micah
@micah
Dec 12 2018 14:47
hmm, journalctl -f kresd@1 tells me Failed to add match 'kresd@1': Invalid argument
Petr Špaček
@pspacek
Dec 12 2018 14:47
My bad, it should have been journalctl -f -u kresd@1
(I.e. -u was missing.)
micah
@micah
Dec 12 2018 14:48
ok, I do a query, get a SERVFAIL but I'm not seeing anything in the log
Petr Špaček
@pspacek
Dec 12 2018 14:48
Do you have more than 1 kresd instance running on the machine?
micah
@micah
Dec 12 2018 14:48
oops, I just got a bunch
Petr Špaček
@pspacek
Dec 12 2018 14:49
Maybe some buffering in journal delays messages or something, I'm not sure.
micah
@micah
Dec 12 2018 14:49
but they are queries from clients it seems
no, just one instance running
Petr Špaček
@pspacek
Dec 12 2018 14:49
You definitelly must see the localhost. query in there. If it is not there but other traffic is, well, then I suspect you are sending query to different place than you believe :-D
micah
@micah
Dec 12 2018 14:50
i'm doing dig @127.0.0.1 google.com
It might be in the log, its just a big bunch that comes all at once
Petr Špaček
@pspacek
Dec 12 2018 14:51
Then go for journalctl -f -u kresd@1 > /tmp/log and paste the file here.
micah
@micah
Dec 12 2018 14:51
i can paste the log here, but its a bit long
Petr Špaček
@pspacek
Dec 12 2018 14:52
Or use pastebin, as you wish.
micah
@micah
Dec 12 2018 14:54
trying to find one that will let me upload that size
(205k)
Petr Špaček
@pspacek
Dec 12 2018 14:54
Omg, that's A LOT for one query.
micah
@micah
Dec 12 2018 14:55
there are a lot of clients trying to use this resolver right now (and failing)
Petr Špaček
@pspacek
Dec 12 2018 14:55
How many clients do you have?
I see.
micah
@micah
Dec 12 2018 14:55
not a huge amount, but a mail server
which does a lot of queries
lots of AD: request NOT classified as SECURE
Petr Špaček
@pspacek
Dec 12 2018 14:56
Very weird. Can you paste your resolver config as well, please?
I'm more worried about lines which are missing - there are no lines with attempts to actually contact anyone.
micah
@micah
Dec 12 2018 14:58
cache.size = 4 * GB

-- Load Useful modules
modules = {
    'policy',   -- Block queries to local zones/bad sites
    'view',     -- Views for certain clients
    'hints > iterate',    -- Load /etc/hosts and allow custom root hints
    'stats',    -- Track internal statistics
}

view:addr('198.252.153.0/24', function (req, qry) return policy.PASS end)
view:addr('204.13.164.0/24', function (req, qry) return policy.PASS end)
view:addr('199.254.238.0/24', function (req, qry) return policy.PASS end)
view:addr('199.58.80.128/25', function (req, qry) return policy.PASS end)
view:addr('199.58.81.137/25', function (req, qry) return policy.PASS end)
view:addr('10.0.1.0/24', function (req, qry) return policy.PASS end)
view:addr('10.0.2.0/24', function (req, qry) return policy.PASS end)
view:addr('127.0.0.1', function (req, qry) return policy.PASS end)
--- Drop everything that hasn't matched
view:addr('0.0.0.0/0', function (req, qry) return policy.DROP end)
(i'm allowing my netblocks to do queries, but nowhere else)
on the second machine where this is happening, I issued a systemctl restart kresd@1 and now I can do queries against it
but I have not restarted this one, so we can continue to debug
Petr Špaček
@pspacek
Dec 12 2018 15:00
Great, thanks!
Okay, please connect to the control socket again, we will try to gather more information.
micah
@micah
Dec 12 2018 15:04
ok
Petr Špaček
@pspacek
Dec 12 2018 15:04
Once you are connected please run stats.upstreams(), it will print a table.
micah
@micah
Dec 12 2018 15:05
it does
Petr Špaček
@pspacek
Dec 12 2018 15:05
Can you paste it here?
(Or elsewhere.)
micah
@micah
Dec 12 2018 15:06
having trboule hitting enter here
i seem to get into edit mode in the chat, and cannot hit enter to send the paste
Petr Špaček
@pspacek
Dec 12 2018 15:07
Maybe it's too big so it refuses the message, just guessing.
Petr Špaček
@pspacek
Dec 12 2018 15:09
Okay, so it was able to communicate with quite a lot of servers for some time...
Out of curiosity, what's output of stats.list() in the kresd console?
Most importantly, do numbers increase between two consecutive runs of stats.list()?
micah
@micah
Dec 12 2018 15:11
[answer.nxdomain] => 15497
[answer.100ms] => 3711
[answer.1500ms] => 31
[answer.slow] => 855
[answer.servfail] => 7546677
[answer.250ms] => 1095
[answer.cached] => 362
[answer.nodata] => 314
[query.edns] => 7284979
[query.dnssec] => 7284932
[answer.total] => 7565696
[answer.10ms] => 2137
[answer.noerror] => 3009
[answer.50ms] => 10479
[answer.500ms] => 402
[answer.1000ms] => 89
[answer.1ms] => 456

>
the servfail number is increasing
and the edns one
and dnssec
and total
Petr Špaček
@pspacek
Dec 12 2018 15:12
Okay, so at least the module machinery is working as expected.
Let me think what else we can gather from this.
micah
@micah
Dec 12 2018 15:13
otherwise they stay the same
Petr Špaček
@pspacek
Dec 12 2018 15:13
So answer.cached does not increase, can you confirm that?
micah
@micah
Dec 12 2018 15:14
correct
Petr Špaček
@pspacek
Dec 12 2018 15:14
Okay, let me dig deeper.
micah
@micah
Dec 12 2018 15:14
its still at 362 and I did a few more queries
Petr Špaček
@pspacek
Dec 12 2018 15:15
It's very interesting, it has to fail quite soon after start (noerror numbers are small).
micah
@micah
Dec 12 2018 15:15
indeed, the other server that I restarted to get it working again has already started to fail
Petr Špaček
@pspacek
Dec 12 2018 15:16
Wow, we never ever saw that!
micah
@micah
Dec 12 2018 15:16
but if I restart it, it works again
Petr Špaček
@pspacek
Dec 12 2018 15:17
If it is reproducible so quickly, please add verbose(true) to your config file and restart it - it would be super useful to get logs between daemon start and beginning of the failure mode.
@micah My apologies for this!
micah
@micah
Dec 12 2018 15:17
is that in the config file, or on the console?
Petr Špaček
@pspacek
Dec 12 2018 15:18
For this purpose to beginning of config file so its applied right after start.
micah
@micah
Dec 12 2018 15:18
its ok! I appreciate the debugging help!
Petr Špaček
@pspacek
Dec 12 2018 15:18
But technically console and config file are the very same - they are basically Lua script, console is just interactive version of it.
micah
@micah
Dec 12 2018 15:19
ok, i just added verbose(true) and restarted. I see a lot more query info than the log I pasted before when things stopped
Petr Špaček
@pspacek
Dec 12 2018 15:25
I suspect that once its starts failing it will produce the same terse output.
micah
@micah
Dec 12 2018 15:27
ok, we are at SERVFAIL
and yes, were are getting the terse output
Petr Špaček
@pspacek
Dec 12 2018 15:28
Okay, please paste some reasonable part of the log somewhere. We are not interested in what follows after first SERVFAIL/terse output.
micah
@micah
Dec 12 2018 15:28
I don't see any obvious message between when it was giving a lot of messages, vs. the terse messages
Petr Špaček
@pspacek
Dec 12 2018 15:29
It might be somewhere further before, e.g. some error in autoritative server reputation logic/connection management or something.
Once you have the log uploaded somewhere you can try this in kresd console (when it SERVFAILs):
for idx, name in pairs(modules.list()) do if name ~= 'iterate' then modules.unload(name) end end
Petr Špaček
@pspacek
Dec 12 2018 15:32
It will unload all modules except basic iterate module. I'm curious if it will start working or not.
In any case it will limit the are where we have to look for problem.
micah
@micah
Dec 12 2018 15:32
just tried that, it did not start working
Petr Špaček
@pspacek
Dec 12 2018 15:33
Okay, so feel free to restart the resolver to get back to /normal/ state.
micah
@micah
Dec 12 2018 15:36
i'll note that the log I uploaded isn't from the beginning of when the server started
but as far back as that journal entry went
Petr Špaček
@pspacek
Dec 12 2018 15:37
Interesting. For some reason the log is cluttered with usable NS addresses: 0/1 or similar, so it seems kresd is not able to communicate with authoritative servers and it accumulates so long that it fails completely.
micah
@micah
Dec 12 2018 15:37
i noticed those too
Petr Špaček
@pspacek
Dec 12 2018 15:39
Okay, I would try to append cache.ns_tout(100) to your config file and restart the resolver again. Let's see what happens if we lower auth retry timeout to 100 ms.
micah
@micah
Dec 12 2018 15:40
ok
micah
@micah
Dec 12 2018 15:46
these AD: request NOT classified as SECURE are nothing to be worried about?
Petr Špaček
@pspacek
Dec 12 2018 15:46
That's fine, that's result of DNSSEC validation - normal for domains which are not DNSSEC-signed.
micah
@micah
Dec 12 2018 15:46
i'm seeing these now, not sure they were there before [11245]: [ ][nsre] probing timeouted NS: 50.63.202.67, score 2000
(strange they don't have the [] filled in)
i'm in SERVFAIL state again
Petr Špaček
@pspacek
Dec 12 2018 15:49
Well, IP 50.63.202.67 does not reply to my DNS queries either ...
Maybe try in the kresd console cache.ns_tout(1) and we will see if logs will slow frequent retries or not...
micah
@micah
Dec 12 2018 15:52
ok, i set that
Petr Špaček
@pspacek
Dec 12 2018 15:52
Is there any /new/ activity in the log?
micah
@micah
Dec 12 2018 15:52
it looks the same after SERVFAIL
Petr Špaček
@pspacek
Dec 12 2018 15:55
I'm looking into the code and it seems that usable NS addresses: X/Y message when we decrease number of working authoritative IP addresses. Interesting ...
micah
@micah
Dec 12 2018 16:05
it seems to take between 5-10minutes when things stop working
Petr Špaček
@pspacek
Dec 12 2018 16:06
Colleague of mine is building 3.0 packages for you to test so we can verify if it is a new problem in 3.1.0 ...
micah
@micah
Dec 12 2018 16:08
i'll be back in ~20minutes and can provide any further debug/test
Petr Špaček
@pspacek
Dec 12 2018 16:09
Thank you!
Petr Špaček
@pspacek
Dec 12 2018 16:22
Okay, so we finally noticed the log is not complete - journald apparently has some rate limiting enabled by default ... :-)
Anyway, once you are here please run cache.clear() in kresd console, I'm curious if it starts working after that.
micah
@micah
Dec 12 2018 16:40
ok, i'll try it
micah
@micah
Dec 12 2018 16:47
@pspacek ok, I waited for SERVFAIL state, went to the console and did cache.clear() and it told me [count] => 62909 but didn't start to work
Petr Špaček
@pspacek
Dec 12 2018 16:48
Very interesting, that rules out some of possible sources of problems.
I wonder how to get even more information ... Are you familiar with GDB debugger?
micah
@micah
Dec 12 2018 16:51
i am, I know how to do basic things with GDB, mostly getting a backtrace when something segfaults
Petr Špaček
@pspacek
Dec 12 2018 16:52
Okay, I will think of ways how to get more information in a reasonable way.
micah
@micah
Dec 12 2018 17:08
I'm using rbldsnd for a RBL query, and it has a bunch of replies that are in the .dnsbl TLD (not a legit TLD). I want to set that as a negative trust anchor, so DNSSEC validation is not attempted with it, can I do that with trust_anchors.negative = { 'dnsbl' }
the example on https://knot-resolver.readthedocs.io/en/stable/daemon.html?highlight=negative%20trust%20anchor#trust-anchors-and-dnssec only talks about domain.tld, so I'm unsure if I put dnsbl it will be treated as a TLD
Petr Špaček
@pspacek
Dec 12 2018 17:09
Yes, that should work.
micah
@micah
Dec 12 2018 17:10
maybe unrelated to the issue of things failing, but I do have to query a lot of RBLs to check for spammers, and they kind of 'abuse' the DNS system by returning 127.x.x.x responses as signaling and if an IP is not listed it returns a NXDOMAIN
Petr Špaček
@pspacek
Dec 12 2018 17:10
Well, hopefully, it might theoretically still conflict with agressive proof of nonexistence. @vcunat Do you remember if NTA works for fake TLDs?
@micah In meanwhile please try to add net.ipv6 = false to your config, it will eliminate some unsucessful attempts to use IPv6 (which is, I guess from logs, not working on your system).
micah
@micah
Dec 12 2018 17:13
indeed, there is no ipv6 address configured on the interfaces
Petr Špaček
@pspacek
Dec 12 2018 17:14
Okay, you can try this to see if it makes any difference. After that you can download 3.0.0 packages from https://software.opensuse.org//download.html?project=home%3ACZ-NIC%3Aknot-resolver-testing&package=knot-resolver (testing repo) to see if it works better than 3.1.0 or not.
micah
@micah
Dec 12 2018 17:15
ok, I've changed the net.ipv6 = false and restarted, and am waiting for failure
Petr Špaček
@pspacek
Dec 12 2018 17:16
Thank you for your time!
micah
@micah
Dec 12 2018 17:16
in the meantime, the trust_anchors.negative { 'dnsbl' } didn't seem to work. I still get NXDOMAIN in response from them
actually, ignore that. I need to set something up still to confirm that ^
micah
@micah
Dec 12 2018 17:21
it seems I need to actually forward all requests for the TLD dnsbl to the dnsbl server
and I probably need to set the negative trust anchor
Petr Špaček
@pspacek
Dec 12 2018 17:21
See the link above, it has configuration snippet you need.
micah
@micah
Dec 12 2018 17:26
this extraTrees = policy.todnames({'libre', 'null'}) <-- I'd replace libre here with the actual dns names? And can I list multiple here in an array?
Vladimír Čunát
@vcunat
Dec 12 2018 17:30
Oops, 'null' could be mistaken as a list terminator in the example but it's actual TLD of OpenNIC :-)
Petr Špaček
@pspacek
Dec 12 2018 17:30
So yes, you can list multiple names or just one in there.
micah
@micah
Dec 12 2018 17:31
err, oh! does that mean that 'libre', 'null' are two different names?
Petr Špaček
@pspacek
Dec 12 2018 17:32
Yes, they are!
I will polish the example in docs to make that clearer.
micah
@micah
Dec 12 2018 17:33
interestingly... since disabling ipv6, I have not had a SERVFAIL yet
and it has been 20 minutes
Petr Špaček
@pspacek
Dec 12 2018 17:33
Really? That would be a breakthough!
micah
@micah
Dec 12 2018 17:34
yeah, was last restarted at :13, and it is now :33 and I can still query it
Petr Špaček
@pspacek
Dec 12 2018 17:34
Maybe you can check stats.list() as you did before to make sure majority of queries is being replied with non-SERVFAIL responses.
micah
@micah
Dec 12 2018 17:35
would be nice to have readline functionality :)
Petr Špaček
@pspacek
Dec 12 2018 17:35
Well, it even has web interface with interactive graphs :-D
This socket interface is mainly for scripting.
Vladimír Čunát
@vcunat
Dec 12 2018 17:37
What's rough request rate for the resolver (requests per second)?
Petr Špaček
@pspacek
Dec 12 2018 17:51
@micah I have to leave now, thank you very much for your patience. We will look into this and see what broke in 3.1.
In case you want to experiment it would be valuable to try 3.0 without net.ipv6 = false to see how it behaves...
Anyway, thank you once again and see you later!
micah
@micah
Dec 12 2018 18:00
ok! I will try that, the net.ipv6 setting seems to be keeping it alive
micah
@micah
Dec 12 2018 18:37
it seems like 3.0 is working, I do not have net.ipv6 = false configured on this machine and it has been running for 30minutes without failing
with 3.1, I can only keep things running for 5-10 minutes, unless I set net.ipv6 = false
Ghost
@ghost~5c11633cd73408ce4fb1c1a4
Dec 12 2018 19:46

Hello guys new to knot resolver. I've a small problem which i couldn't figure it out yet. What i want to do is enforce forcesafesearch for search engines and youtube , but according to the docs its still not yet possible to do it with rpz. I managed using

policy.add(policy.suffix(policy.FORWARD('127.0.0.2'), {todname('google.com')})) , where 127.0.0.2 is another resolver with forcesafesearch configured properly. Is there any other way to implement this in knot ? so that i don't have to run/forward to another resolver to accomplish this.

Petr Špaček
@pspacek
Dec 12 2018 20:52
@micah Excellent job, thank you very much for cooperation! We will look into this before 3.2 release.
micah
@micah
Dec 12 2018 21:39
@pspacek I would guess that a simple setup of 3.1 on a debian stretch system without ipv6 should be able to reproduce this
However, I'm happy to provide more details or help debugging if you need it... but until then, I'll continue with my project to replace bind :)