Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Andreas Rammhold
@andir
oerdnj: it died again
but no coredump yet :/
ahh well... it did log 50GB of stuff... I'll have to remove the verbose flag or get largers disks for those tests.. I'll run it again with empty caches
Ondřej Surý
@oerdnj
so it ate all the disk space? and it's crashing because lmdb fails to write to cache?
Andreas Rammhold
@andir
well it worked before... i'm trying to reprodcue now :-)
Vladimír Čunát
@vcunat
Out-of-space on cache write should be handled on our side, but it's certainly almost untested condition.
Actually, LMDB creates a holey file and mmaps it into memory. I'm not sure if it's possible to catch the situation when it accesses the part that has no disk space assigned yet - and the disk is full ATM.
Andreas Rammhold
@andir
I did some tuning of the rsyslog settings.. they are rather VERBOSE on debian per default.. logging stuff >=2x + journald ...
hopefully that false result wont happen again
Ondřej Surý
@oerdnj
@andir So, everything running smoothly? :)
Andreas Rammhold
@andir
@oerdnj so far so good, it auto upgraded to the version from this morning
Andreas Rammhold
@andir
For a while the predict.queue metric also did decline.. now with a 12h period it only grows (even after >12h) :/ Not sure if 12h are either a good idea or practical at all..
Vladimír Čunát
@vcunat
I think the predict module isn't very advanced ATM. For high-traffic resolvers it might only be usable with a short window (and period).
Vladimír Čunát
@vcunat
Note that the predictions are done from a table of estimated most-frequent queries - and that table has ~5k lines.
Andreas Rammhold
@andir
i'll investigate predction tuning when I'm back from congress/holiday travels.
image.png
thats my current graph of the predict queue :-) Looked fine a few versions ago.. can't remmeber when I switched to the rather large window/period
Andreas Rammhold
@andir
I just ran into a SERVFAIL condition where some records (e.g. google.com) did respond with SERVFAIL and others worked right. Cache flush did fix it. Version is latest from yesterday. I'll have time to check logs tomorrow if required :-)
Vladimír Čunát
@vcunat
An A record for google.com failed sometimes?
Andreas Rammhold
@andir
always
from some point in time on
matrixbot
@matrixbot
ondrej Do you have idea how full your cache was at that time?
Andreas Rammhold
@andir
I can pull up the graphs later. flokli might be able to give us that information. I'm out with my phone only :-)
matrixbot
@matrixbot
ondrej I am watching Little Mole with my kids and I am on my phone ;)
Vladimír Čunát
@vcunat
We merged major changes today to master, and these fixed almost all of the bad responses we knew of. With google.com there's the "problem" that they serve different DNS on different locations...
matrixbot
@matrixbot
ondrej The Debian/Ubuntu packages were already on the vld-refactoring branch, but there were more fixes today. I'll package the updated version today or tomorrow morning....
Florian Klink
@flokli
vcunat, ondrej: these are the graphs: http://i.cubeupload.com/AR5OSj.png
cache.delete shows pretty good when cache was flushed
the ~10-15minutes before that look strange
(scroll down first)
Andreas Rammhold
@andir
thanks flokli :-)
Florian Klink
@flokli
np
Andreas Rammhold
@andir
are we missing a cache.size graph?
Florian Klink
@flokli
I think that's all graphs I had... Let me check
Florian Klink
@flokli
andir: nope, no cache.size
Ondřej Surý
@oerdnj
I am building today's git version, there are some fixes to DNSSEC validation.
Ondřej Surý
@oerdnj
@andir It would be great, if you could sieve the logs around when the failures started to happen.
Did it all happen in one TLD, or randomly across more?
Vladimír Čunát
@vcunat
Hmm, yes 19:35--19:45 seems the range you speak of. It's like stopped or paralyzed at the start, as even answer.total drops which should be the number of replies sent to clients (regardless of success/failure etc).
Vladimír Čunát
@vcunat
Logs from the start of the period might shed more light on it. It certainly doesn't sound like possibly affected by the most recent changes we've done after the version you used.
Andreas Rammhold
@andir
Well I've just checked. The earliest ist 19:00. But the graphs and some other systems on our side suggest that it started around 16:00. It seems like only the .com domain is affected.. I'll give that machine another 300GB of space for logs...
Ondřej Surý
@oerdnj
@andir Just to be sure, you had 1.2.0~20170110 installed, right?
Andreas Rammhold
@andir
yes
1.2.0~20170110-1+0~20170110151612.41+jessie~1.gbp7bff85
Ondřej Surý
@oerdnj
There were some changes in zone cut processing in that version as we found a similar bug in .org processing. Logs with failures would be extremely valuable.
Andreas Rammhold
@andir
I'll have to change our monitoring script so that it detects that issue and stops sending traffic to knot.. I'll then add it again and collect logs
Ondřej Surý
@oerdnj
@andir do you have more samples of domains that were failing that you can share?
thats a brief excerpt of strang SERVFAILs that i've found