Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Héctor Molinero Fernández
@hectorm

Hi, I'm building Knot Resolver 5.1.2 on a Raspberry Pi 4 with Ubuntu 20.04 (arm64) and when I run kresd I get the following error:

PANIC: unprotected error in call to Lua API (bad light userdata pointer)

This is a known issue with LuaJIT on arm64: https://gitlab.nic.cz/knot/knot-resolver/issues/216

The doubt I have is that with the official Knot Resolver package this problem does not occur to me and kresd works fine and I was wondering why it behaves differently.

Vladimír Čunát
@vcunat
I believe we've patched up kresd to not suffer from this, so it normally comes from luajit packages that haven't been fixed yet.
lua-cqueues is the typical one in our case, as it's loaded by default if found.
Héctor Molinero Fernández
@hectorm

Does cqueues suffer from this problem in its version 20200726.51-0?

I tried to install all Lua packages from LuaRocks instead of Ubuntu repositories, but I encountered the same error.

I'm not familiar with the Knot Resolver code neither LuaJIT, but I' ve seen this line in the code and I was wondering if it could be related.
Vladimír Čunát
@vcunat

It is related. We did

    /** Static to work around lua_pushlightuserdata() limitations.
     * TODO: convert to a proper singleton like worker, most likely. */
    static struct engine engine;

just because of that line and that error.

But that's been many months ago.
The best test for cqueues is on that ticket:
luajit -l cqueues -e os.exit(0)
So far I'm not sure why it's causing problems on *.deb systems.
I don't really have access to an aarch64 where I can test deb packages well.
Héctor Molinero Fernández
@hectorm
The luajit -l cqueues -e 'os.exit(0)' command is working properly with moonjit 2.1.2 and cqueues 20200726.51-0. Could you tell me how to know which other LuaJIT package might be causing this crash?
Héctor Molinero Fernández
@hectorm
Nevermind, even if I don't have any LuaJIT packages installed in my system I get the same error when I run kresd. I'm going to need more tests to find out what's going on.
Vladimír Čunát
@vcunat
Hmm, that is interesting.
Héctor Molinero Fernández
@hectorm

To replicate the problem correctly I've written down the build steps in Dockerfiles and the conclusion is that the crash occurs in Ubuntu 20.04, Debian Buster, Debian Sid and Alpine 3. It works correctly in Fedora 32, CentOS 7 and openSUSE Tumbleweed.

I get the same results with the last commit of the v2.1 branch of LuaJIT (570e758) and with Moonjit 2.1.2.

I'm not using any Lua package and the Docker images have been built directly on a Raspberry Pi 4 (8 GB) with a 64-bit kernel.

Héctor Molinero Fernández
@hectorm
@vcunat If you think this might be helpful I can post my results on the issue you linked above.
Vladimír Čunát
@vcunat
Yes. That does sound useful. Thanks!
Héctor Molinero Fernández
@hectorm
Furthermore, if you don't have the required hardware I can provide SSH access to my Raspberry Pi to help you track down this issue.
Vladimír Čunát
@vcunat
I'll see, maybe we can manage to test this reasonably easily. I'll send a private message here otherwise.
mrvne
@mrvne
Hello guys, any ideas what this error message could be? See this quite often.
kresd[32410]: ERROR: udp sendmmsg() sent -1 / 1; Operation not permitted
Vladimír Čunát
@vcunat
It's about sending UDP answer.
I think I've seen EPERM on systems that explicitly disable IPv6 (and you try to use it).
mrvne
@mrvne
I am using IPv6 and its doing it's job fine. So should I be worried about this error?
Vladimír Čunát
@vcunat
The line shows that one answer gets lost, but I don't understand how it can happen.
Petr Špaček
@pspacek
This also happen is connection tracking in firewall is not performant enough. Is the system under load? How many queries per second do you see?
mrvne
@mrvne
Unfortunately logging is disabled. And as far as I know there is no option to see it on the fly? There are many clients connected and the resolver should be under middle/heavy load.
Vladimír Čunát
@vcunat
I know e.g. about atop showing packet thoughput per interface. I'm not sure if something can show it per process, but it may not matter if not much else is producing them.
mrvne
@mrvne
Its a vpn server, so tons of traffic is going over the interface.
Vladimír Čunát
@vcunat
worker.stats().queries over control socket will show the number of queries processed so far.
Petr Špaček
@pspacek
Heavy load supports hypothesis that the EPERM socker error comes from firewall which is not keeping up. We have seen this in our tests, so generally we unload all iptables/nftables/conntrack modules from kernel :-)
mrvne
@mrvne
I have 2 sockets running, so first one got: 13265967 and second: 13296084. Nftables is present and allowing only internal IPs to access resolver.
Petr Špaček
@pspacek
You can query stats after some period of time (let's say one minute), substract and divide it and obtain queries per second.
mrvne
@mrvne
Seems to be ~19 queries/s (measured in a 5minutes range)
but it can increase rapidly at any time, because the user count is fluctuating
Petr Špaček
@pspacek
Heh, 19/s is nothing. 190 000 would be number where we would need to think more but 19/s is peanuts.
mrvne
@mrvne
:))) fine, still worried about the error. Actually everything seems to work fine, but in any case i will keep this monitoring.
Petr Špaček
@pspacek
How frequently do you see the error?
mrvne
@mrvne
In average ~5 entries every 30-45mins
At sometime there are no errors in 2hours range at all
Petr Špaček
@pspacek
Interesting. Can you see a pattern? Something like smaller clusters in time?
mrvne
@mrvne
Those errors are pretty always in small batches with 4-6 entries. Sometimes they are also different like here:
ERROR: udp sendmmsg() sent -1 / 2; Operation not permitted
ERROR: udp sendmmsg() sent -1 / 1; Operation not permitted
ERROR: udp sendmmsg() sent -1 / 3; Operation not permitted
The / 2 , / 1 , /3 not sure what they mean, but i see a mix of it
Vladimír Čunát
@vcunat
The number after slash is the number of answers attempted to be sent in the (failed) batch.
(giving an upper bound on the number of lost answers)
Petr Špaček
@pspacek
Very interesting. Do you have some firewall rules for OUTPUT chain?
mrvne
@mrvne
Nope. There are no rules in the OUTPUT chain.
Got just two rules in the INPUT chain to allow internal subnets to access resolver. Thats pretty much it
Petr Špaček
@pspacek
Even more interesting. I have no idea what is going on. This error likely comes from kernel so we would have to go dig into kernel sources :-(
pguizeline
@pguizeline
Hi! I'm trying to use the resolver with RPZ and I'm getting the following error:
[poli] RPZ /etc/knot-resolver/blocklist.rpz:95939: invalid number
Maybe it's a parsing error on the file I'm generating? I've looked at the affected line an it looks okay, like the rest of the file. I can't find anything on the docs about this error. Thanks in advance for any help!
Vladimír Čunát
@vcunat
I can't remember seeing such an error. Can you post the problematic DNS record? (a single line probably)
pguizeline
@pguizeline
Sure!