These are chat archives for CZ-NIC/knot-resolver

13th
Aug 2018
Robert Šefr
@robcza
Aug 13 2018 09:32
Is there any best practice, checklist or script what to do if a resolver behaves in a strange way? E.g. we can see a huge latency increase, what should I check? Having such a troubleshooting list and building it up in the docs would be great and I would like to contribute as welll
Vladimír Čunát
@vcunat
Aug 13 2018 09:34
I would look at the verbose log, though I'm not sure how readable that is for others.
Robert Šefr
@robcza
Aug 13 2018 09:36
@vcunat can I dump the whole active configuration?
Vladimír Čunát
@vcunat
Aug 13 2018 09:38
@robcza: I don't get what you mean. You can activate verbose logging by passing --verbose when starting the process or by verbose(true) in configuration (including CLI of a running process).
Robert Šefr
@robcza
Aug 13 2018 10:04

@vcunat verbose logging is quite clear, I use it often.
I have a particular issue I'm not able to deal with:

PID   USER     TIME  COMMAND
  108 root     14:33 /usr/local/sbin/kresd -f 1 -c /etc/kres/kres.conf /tty/
  229 root      0:42 /usr/local/sbin/kresd -f 1 -c /etc/kres/kres.conf /tty/
root@machine:~# socat - UNIX-CONNECT:/var/lib/kres/tty/108

I'm not getting the CLI here, it just hangs. For the process 229 it works ok and I can work with the CLI, but not for 108.
At the same moment, this machine is reported as a one with issue with resolution. Both processes are bound to port 53 the traffic balanced through SO_REUSEPORT.

Any ideas how to identify the issue?

Vladimír Čunát
@vcunat
Aug 13 2018 10:32

Alive process but hanging? I don't remember seeing such a state since a problem with incorrect libpthread lock implementation, but that should have been patched in glibc's recent versions.

I expect 108 won't process any queries anymore - is it consuming CPU? (Typical states might be none or 100%.)

Robert Šefr
@robcza
Aug 13 2018 10:49
@vcunat no, nothing like this: load average: 0,03, 0,08, 0,04
Vladimír Čunát
@vcunat
Aug 13 2018 10:54
@robcza: I'd try to attach gdb to it and print the stack trace. Even without debug symbols, if you don't have those.
Robert Šefr
@robcza
Aug 13 2018 11:24
This is what I was able to get. The process died eventually, not really sure whether as a result of my actions or for another reason:
(gdb) bt full
#0  0x00007fb94f2ded09 in ?? () from /lib/ld-musl-x86_64.so.1
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.
Vladimír Čunát
@vcunat
Aug 13 2018 11:25
Wait, you're running this on musl? :-) Hopefully it's not some musl-specific bug.
(e.g. like glibc had in the pthreads locks)
Petr Špaček
@pspacek
Aug 13 2018 11:26
BTW generic instructions for debugging hung processes apply, see e.g. http://www.port389.org/docs/389ds/FAQ/faq.html#debug_hangs
Robert Šefr
@robcza
Aug 13 2018 11:28
@vcunat it is a docker build based on https://hub.docker.com/r/cznic/knot-resolver
@pspacek this is great, thank you
Vladimír Čunát
@vcunat
Aug 13 2018 11:28
Oh, I forgot, Alpine. From larger deployments I know only about Omnia (a couple thousand instances) using ld-musl-armhf.
Robert Šefr
@robcza
Aug 13 2018 11:31
@vcunat according to the stack trace, can we assume it hung on musl?
Vladimír Čunát
@vcunat
Aug 13 2018 11:32
Probably. The top is shown inside the library. It might be in a syscall FWIW. (I don't have much experience debugging such hangs.)
Petr Špaček
@pspacek
Aug 13 2018 11:33
I think this stack is useless, it is quite possible that GDB did has given up and did not unwind the stack.
@robcza Beware that this Docker image is just a tutorial toy, not really meant for anything else than show case. It is complerely untested!
Vladimír Čunát
@vcunat
Aug 13 2018 11:34
I don't think that would print libc in #0, but I'm not sure.
Robert Šefr
@robcza
Aug 13 2018 11:38
@pspacek understood, going to switch the whole thing to fedora
Vladimír Čunát
@vcunat
Aug 13 2018 11:40
We maintain the Fedora packages even downstream, so they should be relatively fresh all the time.