Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Vladimír Čunát
@vcunat
We merged major changes today to master, and these fixed almost all of the bad responses we knew of. With google.com there's the "problem" that they serve different DNS on different locations...
matrixbot
@matrixbot
ondrej The Debian/Ubuntu packages were already on the vld-refactoring branch, but there were more fixes today. I'll package the updated version today or tomorrow morning....
Florian Klink
@flokli
vcunat, ondrej: these are the graphs: http://i.cubeupload.com/AR5OSj.png
cache.delete shows pretty good when cache was flushed
the ~10-15minutes before that look strange
(scroll down first)
Andreas Rammhold
@andir
thanks flokli :-)
Florian Klink
@flokli
np
Andreas Rammhold
@andir
are we missing a cache.size graph?
Florian Klink
@flokli
I think that's all graphs I had... Let me check
Florian Klink
@flokli
andir: nope, no cache.size
Ondřej Surý
@oerdnj
I am building today's git version, there are some fixes to DNSSEC validation.
Ondřej Surý
@oerdnj
@andir It would be great, if you could sieve the logs around when the failures started to happen.
Did it all happen in one TLD, or randomly across more?
Vladimír Čunát
@vcunat
Hmm, yes 19:35--19:45 seems the range you speak of. It's like stopped or paralyzed at the start, as even answer.total drops which should be the number of replies sent to clients (regardless of success/failure etc).
Vladimír Čunát
@vcunat
Logs from the start of the period might shed more light on it. It certainly doesn't sound like possibly affected by the most recent changes we've done after the version you used.
Andreas Rammhold
@andir
Well I've just checked. The earliest ist 19:00. But the graphs and some other systems on our side suggest that it started around 16:00. It seems like only the .com domain is affected.. I'll give that machine another 300GB of space for logs...
Ondřej Surý
@oerdnj
@andir Just to be sure, you had 1.2.0~20170110 installed, right?
Andreas Rammhold
@andir
yes
1.2.0~20170110-1+0~20170110151612.41+jessie~1.gbp7bff85
Ondřej Surý
@oerdnj
There were some changes in zone cut processing in that version as we found a similar bug in .org processing. Logs with failures would be extremely valuable.
Andreas Rammhold
@andir
I'll have to change our monitoring script so that it detects that issue and stops sending traffic to knot.. I'll then add it again and collect logs
Ondřej Surý
@oerdnj
@andir do you have more samples of domains that were failing that you can share?
thats a brief excerpt of strang SERVFAILs that i've found
Ondřej Surý
@oerdnj
@andir Could you check for first occurence of dns1.fpt.vn. (or dns2.fpt.vn.) in the log? And if the SERVFAILs started to happen after that?
You are right, the log is confusing
@andir And you just issued cache.clear() or did you restart the server to make it work again?
Andreas Rammhold
@andir
first I tried restarting but that didn't help. I then cleared the cache
i've no earlier logs for that entry.. forgot to copy them.. typing on the phone is crap :/
Ondřej Surý
@oerdnj
@vcunat It looks like rplan is broken
But I have no idea why that might happen
Vladimír Čunát
@vcunat
Why rplan?
(I mean, any particular reason?)
Ondřej Surý
@oerdnj
@vcunat: the subqueries are always the same for failing queries
Vladimír Čunát
@vcunat
I don't think the subqueries belong to them. The [plan] lines in-between are always immediately finished without any log lines.
Andreas Rammhold
@andir
I guess one could add the query id (or such, or some pointer addr) to each line for a specific query?
That would ease the grep'ing
Ondřej Surý
@oerdnj
might be, I already spoke to gdemidov to add query-id to the logs
Vladimír Čunát
@vcunat
Yes, in practice there are too many queries solved at once.
Ondřej Surý
@oerdnj
@andir I have just triggered the build with todays git master. It resolves all resolution failures we know. Did the global failure with clear.cache() as a remedy occurred again or was it just one time thing?
Andreas Rammhold
@andir
well i've not fired it up yet since.. been side tracked with other issues and also not very motivated to write an extended monitoring script for kresd yet :-) I'll try to do that tomorrow
matrixbot
@matrixbot
ondrej Ok, try with today's build. We are very close to releasing it as 1.2.0, so any testing would be appreciated. Also the log format has been extended to track the query/subquery tree.
Andreas Rammhold
@andir
okay, will be first thing in the morning
Andreas Rammhold
@andir
a day late but i've started the resolver again
Ondřej Surý
@oerdnj
@andir crossed fingers :); please let us know if you encounter anything
Andreas Rammhold
@andir
I will :-)
Ondřej Surý
@oerdnj
There's too much brokeness in the DNS world. People don't follow standards, but they expect the resolvers to cope with the shit they throw at them
Andreas Rammhold
@andir
just received a SERVFAIL for one of my domains when querying an AAAA.. 2nd attempt did succeeed :/ I'm checking the logs..