These are chat archives for CZ-NIC/knot-resolver

20th
Dec 2016
Andreas Rammhold
@andir
Dec 20 2016 11:50
Hi, this room looks rather dead.. just wanted to commend on the above systemd stuff but realized thats >1 month old.. This chat isn't active anymore?
Vladimír Čunát
@vcunat
Dec 20 2016 11:51
People do listen here.
Andreas Rammhold
@andir
Dec 20 2016 11:52
Ah great :-) Any idea if you just switched servers for deb.knot-dns.cz ? I retrieve two different answers from *.ns.nic.cz.. one has an invalid SSL cert
a,b,d respond with 195.22.28.210 (which has the invalid cert) and c responds with howl.labs.nic.cz / 217.31.192.150
Vladimír Čunát
@vcunat
Dec 20 2016 11:55
I'm getting 217.31.192.150 from all of them.
But you might be redirected to different instances, I think.
That IP seems suspicious to me.
kdig -x 195.22.28.210 +short
beta.on-sys.net.
Andreas Rammhold
@andir
Dec 20 2016 11:58
yeah thats the site with invalid cert
well that occurs on a host running the latest knot-resolver so.... ;-)
Vladimír Čunát
@vcunat
Dec 20 2016 11:58
I don't think it's our IP at all.
Andreas Rammhold
@andir
Dec 20 2016 11:59
on my local machine I just tried to reproduce (same ASN, same route paths, ..) no success
Vladimír Čunát
@vcunat
Dec 20 2016 12:00
And kresd logs suggest that this IP was returned from *.ns.nic.cz, right?
Andreas Rammhold
@andir
Dec 20 2016 12:00
I did dig them manually and yes that is what I did read there
not sure what went wrong there:
;; AUTHORITY SECTION:
cz.            172800    IN    NS    ns1.csof.net.
cz.            172800    IN    NS    ns4.csof.net.
let me wipe the cache and restart the daemon.. not sure how it got that confused..
Vladimír Čunát
@vcunat
Dec 20 2016 12:04
Ugh, that looks bad.
Andreas Rammhold
@andir
Dec 20 2016 12:05
It could be a few reasons: a) this morning it ran out of shm since /var/run/knot-resolver/cache did hold the cache. b) the updated package this morning did break it or isn't able to read the "old" cache
Do you run with validation turned on?
Andreas Rammhold
@andir
Dec 20 2016 12:08
just default settings with webif, a bit of cache and prediction
Vladimír Čunát
@vcunat
Dec 20 2016 12:10
So you don't specify the location for the file with root trust anchors?
Andreas Rammhold
@andir
Dec 20 2016 12:10
it is specified in the start arguments
this is how it is running atm: /usr/sbin/kresd --config=/etc/knot-resolver/kresd.conf --verbose --forks=1 --keyfile=/usr/share/dns/root.key /var/cache/knot-resolver/cache
Andreas Rammhold
@andir
Dec 20 2016 12:47
that behaviour has gone away after clearing the cache.. still odd.. now towards fixing the monitoring script.. http endpoint only responds 503 :-)
matrixbot
@matrixbot
Dec 20 2016 13:03
ondrej Andreas, what version are you running?
ondrej I guess we'll need to clear the cache between major upgrades automatically
Vladimír Čunát
@vcunat
Dec 20 2016 13:04
We have a notion of cache version, and cache gets auto-cleared whenever there's a mismatch.
... but I believe we've recently done no changes that would require that. Even switching validation on/off shouldn't be a problem for this.
(The items have flags if they were validated.)
Andreas Rammhold
@andir
Dec 20 2016 13:08
ondrej the version from this morning
Version: 1.2.0~20161220+vld-refactoring-1+0~20161220095900.36+jessie~1.gbp27b69c
from the refactoring branch.. still not sure if that was intended to be released
Vladimír Čunát
@vcunat
Dec 20 2016 13:11
It's certainly not a "release".
Andreas Rammhold
@andir
Dec 20 2016 13:12
well yeah.. not in that kind of sense
matrixbot
@matrixbot
Dec 20 2016 13:12
ondrej I am confident enough to give it more widespread testing from the repository
ondrej I am running it successfully for some time and we need more people to give feedback at this moment.
Andreas Rammhold
@andir
Dec 20 2016 13:13
I'l lcertainly try to provide that :-) I'll redirect a few % of our production queries to it now that I seems to work fine again
well that being said.. the invalid entry is back -.-
let me search the log
matrixbot
@matrixbot
Dec 20 2016 13:23
ondrej The logs would be great. That's exactly th kind of feedback we are looking for. Although I am sorry for your inconvenience.
thats what I could find with a brief pass and searching for the deb mirror
matrixbot
@matrixbot
Dec 20 2016 13:24
ondrej Could you please share the config as well? I'd is if default?
Andreas Rammhold
@andir
Dec 20 2016 13:25
i added the config as 2nd file in the gist
matrixbot
@matrixbot
Dec 20 2016 13:30
ondrej The NXDOMAIN at the beginning is weird. Is something intercepting your IPv6 traffic?
ondrej It should be NOERROR
Andreas Rammhold
@andir
Dec 20 2016 13:31
there is nothing in our network intercepting the traffic.. it is dns server -> cisco edge router with decix peering -> d.dns.nic.cz
matrixbot
@matrixbot
Dec 20 2016 13:46
ondrej Something has obviously happened between 13:48:33 and 13:58:59
ondrej And this "NS is provably without DS, going insecure" should not definitely happen.
Andreas Rammhold
@andir
Dec 20 2016 13:48
I see it on other domains as well now... this is very weird..
Vladimír Čunát
@vcunat
Dec 20 2016 13:49
Certainly, but I can't see why any of these happens.
Ondřej Surý
@oerdnj
Dec 20 2016 13:49
@andir Yep, it's running a catchall DNS server, could you find a first occurence of 54.77.72.254 or 54.72.8.183 in the logs? And few lines before?
Vladimír Čunát
@vcunat
Dec 20 2016 13:49
Already the first lines are bad... NXDOMAIN for knot-dns.cz MUST NOT be accepted unless validated.
But the validator produced no output in log on that.
Ondřej Surý
@oerdnj
Dec 20 2016 13:51
@vcunat One more thing to improve in the validator refactoring
Vladimír Čunát
@vcunat
Dec 20 2016 13:51
To be clear, I don't think it's the log line that's missing; rather the validator didn't fire for some reason.
Andreas Rammhold
@andir
Dec 20 2016 13:53
compressed logs are a pain.. still grepping through them...
Ondřej Surý
@oerdnj
Dec 20 2016 13:56
@andir A first occurence after the last cache clear should be enough (jftr)
Andreas Rammhold
@andir
Dec 20 2016 13:57
for what it is worth I found the queries to the same address on the old version already...
Vladimír Čunát
@vcunat
Dec 20 2016 13:59
I think the problem would be already with plan 'cz' type 'NS' somewhere early, like in that unbound thread.
Ondřej Surý
@oerdnj
Dec 20 2016 14:00
If we had a query that triggered the delegation cache overwrite, it would be much easier to solve
Andreas Rammhold
@andir
Dec 20 2016 14:02
I'd loev to know that one... I've 1Mbit/s 'test' traffic.. which produced 7GB of logs in ~20min.. after the last flush with '*' it just started querying there again
from the invalid resolver
Ondřej Surý
@oerdnj
Dec 20 2016 14:03
@andir And you can't share the 'test' traffic, right?
Andreas Rammhold
@andir
Dec 20 2016 14:03
I would rather not do that since it is mirred from production
*mirrored
Ondřej Surý
@oerdnj
Dec 20 2016 14:03
@andir thought so
Andreas Rammhold
@andir
Dec 20 2016 14:04
I could whitelist some of you if you want to run queries against it..
Ondřej Surý
@oerdnj
Dec 20 2016 14:09

@andir could you:

  1. add cache.clear() right after cache.size=1000*M in the config
  2. stop the kresd
  3. clear the logs (or just rotate/checkpoint those)
  4. start kresd
  5. start querying deb.knot-dns.cz from external script
  6. run the 'test' traffic
  7. stop kresd and cut the logs right after it starts giving the wrong answers?

This would minimize the logs produced to analyze

Andreas Rammhold
@andir
Dec 20 2016 14:09
yeha probably.. on it
Ondřej Surý
@oerdnj
Dec 20 2016 14:09
@andir I don't really think that the query side is that interesting (at this point of time)
Andreas Rammhold
@andir
Dec 20 2016 14:12
server is up, returns valid result.. i'll add some traffic
Vladimír Čunát
@vcunat
Dec 20 2016 14:13
I have a test case reproduced locally.
Andreas Rammhold
@andir
Dec 20 2016 14:14
oh, how does it happen then?
Vladimír Čunát
@vcunat
Dec 20 2016 14:14
It's for org. NS.
Just start it up with clear cache, ask for api-nyc01.exip.org and then ask for NS org.
Ondřej Surý
@oerdnj
Dec 20 2016 14:15
vcunat: so it's similar to what i saw with a.root-server.org (without s) yesterday?
Vladimír Čunát
@vcunat
Dec 20 2016 14:15
It will return those *.csof.net values.
Ondřej Surý
@oerdnj
Dec 20 2016 14:15
I thought I saw the *.csof.net recently somewhere
Andreas Rammhold
@andir
Dec 20 2016 14:16
yeah works here too.. :/
Vladimír Čunát
@vcunat
Dec 20 2016 14:16
I found it mentioned as malware-distributing stuff.
Ondřej Surý
@oerdnj
Dec 20 2016 14:18
nah, the root-server.net is something else
@vcunat: Are you able to reproduce it with vld-refactoring branch?
Vladimír Čunát
@vcunat
Dec 20 2016 14:21
Yes, there I reproduce it as above, but not on master (fortunately).
Ondřej Surý
@oerdnj
Dec 20 2016 14:25
vcunat: I am not able to reproduce it on vld-refactoring branch (compiled from git_)
vcunat: I think it the exip.org was already taken down meanwhile in the debug
rusticus
@rusticus
Dec 20 2016 14:28
I am able to reproduce it on vld-refactoring too.
Ondřej Surý
@oerdnj
Dec 20 2016 14:28
@vcunat @rusticus If you are able to see it, please dump the traffic with tcpdump
Vladimír Čunát
@vcunat
Dec 20 2016 14:28
I tried now again, without success.
Ondřej Surý
@oerdnj
Dec 20 2016 14:29
ah, I've changed the RRTYPE from A to AAAA and now my cache has been poisoned as well
Ondřej Surý
@oerdnj
Dec 20 2016 14:40
@rusticus @vcunat: Right:
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 64443
;; Flags: qr aa rd; QUERY: 1; ANSWER: 1; AUTHORITY: 2; ADDITIONAL: 2

;; QUESTION SECTION:
;; APi-NYC01.ExiP.Org.      IN A

;; ANSWER SECTION:
APi-NYC01.ExiP.Org.     100 IN A 127.0.0.1

;; AUTHORITY SECTION:
***Org.                    172800 IN NS ns1.csof.net.***
***Org.                    172800 IN NS ns4.csof.net.***

;; ADDITIONAL SECTION:
ns1.csof.net.           100 IN A 54.77.72.254
ns4.csof.net.           100 IN A 54.72.8.183

;; Received 128 B
;; Time 2016-12-20 15:40:06 CET
;; From 54.72.8.183@53(UDP) in 47.2 ms
It's injecting glue for org.
@andir I'll revert the debian packages to the master branch for this moment before we fix this
Vladimír Čunát
@vcunat
Dec 20 2016 14:41
The localhost A record may also have some purpose, but I can't see what.
Andreas Rammhold
@andir
Dec 20 2016 14:41
oerdnj: thanks :-)
Ondřej Surý
@oerdnj
Dec 20 2016 14:42
@vcunat: something like rebinding attacks (not sure though)
Ondřej Surý
@oerdnj
Dec 20 2016 14:53
@andir knot-resolver_1.2.0~20161220+vld-refactoring-off-1+0~20161220144350.37+jessie~1.gbpee8466 is available from the repository
Andreas Rammhold
@andir
Dec 20 2016 14:53
thanks, i'll give it a run
looks fine so far
Andreas Rammhold
@andir
Dec 20 2016 18:03
Is python2 available on macos when installed via homebrew or per default? I'm trying to setup a local test env for the git branches but my machines are already defaulting to python3 causing a bit of pain :-)
Vladimír Čunát
@vcunat
Dec 20 2016 18:12
@andir: wrong channel?
Andreas Rammhold
@andir
Dec 20 2016 18:13
well not really.. its about deckard which you are using
Vladimír Čunát
@vcunat
Dec 20 2016 18:14
I don't know. I've never used a mac. Still, it seems typical that python2 and python3 can be installed in parallel.
Andreas Rammhold
@andir
Dec 20 2016 18:15
yeah.. It just seems that the tests are also designed to work with mac since it is mentioned so often :-)
i've a patch for deckard which fixes it..
Vladimír Čunát
@vcunat
Dec 20 2016 18:17
They are run on Travis macs regularly.
Andreas Rammhold
@andir
Dec 20 2016 18:18
ahh I see.. Well anyway I've opened a PR for this rather simple fix: https://github.com/CZ-NIC/deckard/pull/3/files
Vladimír Čunát
@vcunat
Dec 20 2016 18:18
Thanks for the patch.
Andreas Rammhold
@andir
Dec 20 2016 18:19
libfaketime is next.. doesn't seem like anyone with a recent gcc (>=6) has used it before.. :/
Vladimír Čunát
@vcunat
Dec 20 2016 18:19
Yes, I bumped into that just today for the first time.
Andreas Rammhold
@andir
Dec 20 2016 18:21
I just ifdef'ed those NULL checks out currently but that doesn't feel right.. It never feels right to remove sanity checks..
Vladimír Čunát
@vcunat
Dec 20 2016 18:22
-Wno-error=... probably
Andreas Rammhold
@andir
Dec 20 2016 18:24
I've a feeling we (as in all developers on the planet) just start collecting Warning exceptions more and more and more... but thats probably okay here