These are chat archives for CZ-NIC/knot-resolver

12th
Jan 2017
Ondřej Surý
@oerdnj
Jan 12 2017 08:55
@andir It would be great, if you could sieve the logs around when the failures started to happen.
Did it all happen in one TLD, or randomly across more?
Vladimír Čunát
@vcunat
Jan 12 2017 09:34
Hmm, yes 19:35--19:45 seems the range you speak of. It's like stopped or paralyzed at the start, as even answer.total drops which should be the number of replies sent to clients (regardless of success/failure etc).
Vladimír Čunát
@vcunat
Jan 12 2017 09:42
Logs from the start of the period might shed more light on it. It certainly doesn't sound like possibly affected by the most recent changes we've done after the version you used.
Andreas Rammhold
@andir
Jan 12 2017 09:43
Well I've just checked. The earliest ist 19:00. But the graphs and some other systems on our side suggest that it started around 16:00. It seems like only the .com domain is affected.. I'll give that machine another 300GB of space for logs...
Ondřej Surý
@oerdnj
Jan 12 2017 10:12
@andir Just to be sure, you had 1.2.0~20170110 installed, right?
Andreas Rammhold
@andir
Jan 12 2017 10:12
yes
1.2.0~20170110-1+0~20170110151612.41+jessie~1.gbp7bff85
Ondřej Surý
@oerdnj
Jan 12 2017 10:15
There were some changes in zone cut processing in that version as we found a similar bug in .org processing. Logs with failures would be extremely valuable.
Andreas Rammhold
@andir
Jan 12 2017 10:19
I'll have to change our monitoring script so that it detects that issue and stops sending traffic to knot.. I'll then add it again and collect logs
Ondřej Surý
@oerdnj
Jan 12 2017 10:20
@andir do you have more samples of domains that were failing that you can share?
thats a brief excerpt of strang SERVFAILs that i've found
Ondřej Surý
@oerdnj
Jan 12 2017 10:32
@andir Could you check for first occurence of dns1.fpt.vn. (or dns2.fpt.vn.) in the log? And if the SERVFAILs started to happen after that?
You are right, the log is confusing
@andir And you just issued cache.clear() or did you restart the server to make it work again?
Andreas Rammhold
@andir
Jan 12 2017 10:33
first I tried restarting but that didn't help. I then cleared the cache
i've no earlier logs for that entry.. forgot to copy them.. typing on the phone is crap :/
Ondřej Surý
@oerdnj
Jan 12 2017 10:34
@vcunat It looks like rplan is broken
But I have no idea why that might happen
Vladimír Čunát
@vcunat
Jan 12 2017 10:35
Why rplan?
(I mean, any particular reason?)
Ondřej Surý
@oerdnj
Jan 12 2017 10:36
@vcunat: the subqueries are always the same for failing queries
Vladimír Čunát
@vcunat
Jan 12 2017 10:42
I don't think the subqueries belong to them. The [plan] lines in-between are always immediately finished without any log lines.
Andreas Rammhold
@andir
Jan 12 2017 10:43
I guess one could add the query id (or such, or some pointer addr) to each line for a specific query?
That would ease the grep'ing
Ondřej Surý
@oerdnj
Jan 12 2017 10:43
might be, I already spoke to gdemidov to add query-id to the logs
Vladimír Čunát
@vcunat
Jan 12 2017 10:44
Yes, in practice there are too many queries solved at once.