Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Andreas Rammhold
@andir
I just ran into a SERVFAIL condition where some records (e.g. google.com) did respond with SERVFAIL and others worked right. Cache flush did fix it. Version is latest from yesterday. I'll have time to check logs tomorrow if required :-)
Vladimír Čunát
@vcunat
An A record for google.com failed sometimes?
Andreas Rammhold
@andir
always
from some point in time on
matrixbot
@matrixbot
ondrej Do you have idea how full your cache was at that time?
Andreas Rammhold
@andir
I can pull up the graphs later. flokli might be able to give us that information. I'm out with my phone only :-)
matrixbot
@matrixbot
ondrej I am watching Little Mole with my kids and I am on my phone ;)
Vladimír Čunát
@vcunat
We merged major changes today to master, and these fixed almost all of the bad responses we knew of. With google.com there's the "problem" that they serve different DNS on different locations...
matrixbot
@matrixbot
ondrej The Debian/Ubuntu packages were already on the vld-refactoring branch, but there were more fixes today. I'll package the updated version today or tomorrow morning....
Florian Klink
@flokli
vcunat, ondrej: these are the graphs: http://i.cubeupload.com/AR5OSj.png
cache.delete shows pretty good when cache was flushed
the ~10-15minutes before that look strange
(scroll down first)
Andreas Rammhold
@andir
thanks flokli :-)
Florian Klink
@flokli
np
Andreas Rammhold
@andir
are we missing a cache.size graph?
Florian Klink
@flokli
I think that's all graphs I had... Let me check
Florian Klink
@flokli
andir: nope, no cache.size
Ondřej Surý
@oerdnj
I am building today's git version, there are some fixes to DNSSEC validation.
Ondřej Surý
@oerdnj
@andir It would be great, if you could sieve the logs around when the failures started to happen.
Did it all happen in one TLD, or randomly across more?
Vladimír Čunát
@vcunat
Hmm, yes 19:35--19:45 seems the range you speak of. It's like stopped or paralyzed at the start, as even answer.total drops which should be the number of replies sent to clients (regardless of success/failure etc).
Vladimír Čunát
@vcunat
Logs from the start of the period might shed more light on it. It certainly doesn't sound like possibly affected by the most recent changes we've done after the version you used.
Andreas Rammhold
@andir
Well I've just checked. The earliest ist 19:00. But the graphs and some other systems on our side suggest that it started around 16:00. It seems like only the .com domain is affected.. I'll give that machine another 300GB of space for logs...
Ondřej Surý
@oerdnj
@andir Just to be sure, you had 1.2.0~20170110 installed, right?
Andreas Rammhold
@andir
yes
1.2.0~20170110-1+0~20170110151612.41+jessie~1.gbp7bff85
Ondřej Surý
@oerdnj
There were some changes in zone cut processing in that version as we found a similar bug in .org processing. Logs with failures would be extremely valuable.
Andreas Rammhold
@andir
I'll have to change our monitoring script so that it detects that issue and stops sending traffic to knot.. I'll then add it again and collect logs
Ondřej Surý
@oerdnj
@andir do you have more samples of domains that were failing that you can share?
thats a brief excerpt of strang SERVFAILs that i've found
Ondřej Surý
@oerdnj
@andir Could you check for first occurence of dns1.fpt.vn. (or dns2.fpt.vn.) in the log? And if the SERVFAILs started to happen after that?
You are right, the log is confusing
@andir And you just issued cache.clear() or did you restart the server to make it work again?
Andreas Rammhold
@andir
first I tried restarting but that didn't help. I then cleared the cache
i've no earlier logs for that entry.. forgot to copy them.. typing on the phone is crap :/
Ondřej Surý
@oerdnj
@vcunat It looks like rplan is broken
But I have no idea why that might happen
Vladimír Čunát
@vcunat
Why rplan?
(I mean, any particular reason?)
Ondřej Surý
@oerdnj
@vcunat: the subqueries are always the same for failing queries
Vladimír Čunát
@vcunat
I don't think the subqueries belong to them. The [plan] lines in-between are always immediately finished without any log lines.
Andreas Rammhold
@andir
I guess one could add the query id (or such, or some pointer addr) to each line for a specific query?
That would ease the grep'ing
Ondřej Surý
@oerdnj
might be, I already spoke to gdemidov to add query-id to the logs
Vladimír Čunát
@vcunat
Yes, in practice there are too many queries solved at once.
Ondřej Surý
@oerdnj
@andir I have just triggered the build with todays git master. It resolves all resolution failures we know. Did the global failure with clear.cache() as a remedy occurred again or was it just one time thing?
Andreas Rammhold
@andir
well i've not fired it up yet since.. been side tracked with other issues and also not very motivated to write an extended monitoring script for kresd yet :-) I'll try to do that tomorrow