These are chat archives for CZ-NIC/knot-resolver

15th
Aug 2018
Robert Šefr
@robcza
Aug 15 2018 08:16
Fedora image up and running. Should I contribute Dockerfile for Fedora build?

On Fedora and kresd 2.4.1 I'm dealing with cache issue and I'm not sure what I'm doing wrong. I have two processes starting at the same time pointed to the same cache lmdb. They both fail at first:

2018-08-15 08:10:06,708 DEBG 'resolver_01' stderr output:
[cache] LMDB error: Resource temporarily unavailable

2018-08-15 08:10:06,708 DEBG 'resolver_01' stderr output:
[cache] LMDB error: Resource temporarily unavailable

2018-08-15 08:10:06,709 DEBG 'resolver_01' stdout output:
[     ][cach] incompatible cache database detected, purging

2018-08-15 08:10:06,713 DEBG 'resolver_00' stderr output:
[cache] LMDB error: Resource temporarily unavailable
[cache] LMDB error: Resource temporarily unavailable

2018-08-15 08:10:06,714 DEBG 'resolver_00' stdout output:
[     ][cach] incompatible cache database detected, purging

2018-08-15 08:10:06,731 DEBG 'resolver_00' stdout output:
[     ][cach] incompatible cache database detected, purging

2018-08-15 08:10:06,731 DEBG 'resolver_00' stderr output:
[cache] LMDB error: Resource temporarily unavailable
[cache] LMDB error: Resource temporarily unavailable
[cache] LMDB error: Resource temporarily unavailable
kresd: lib/cache/peek.c:144: peek_nosync: Assertion `false' failed.

2018-08-15 08:10:06,731 DEBG 'resolver_01' stdout output:
[     ][cach] incompatible cache database detected, purging

2018-08-15 08:10:06,731 DEBG 'resolver_01' stderr output:
[cache] LMDB error: Resource temporarily unavailable
[cache] LMDB error: Resource temporarily unavailable
[cache] LMDB error: Resource temporarily unavailable
kresd: lib/cache/peek.c:144: peek_nosync: Assertion `false' failed.

The configuration is quite simple (same for both processes):

cache.storage = 'lmdb:///var/lib/kres/cache'
cache.size = os.getenv('KNOT_CACHE_SIZE') * MB
Robert Šefr
@robcza
Aug 15 2018 08:26
This emerged when switching from alpine to fedora. It happens on every run when the cache exists prior to execution. If I delete it, the cache is created without any issue, on restart of the same version+config it fails like this.
Petr Špaček
@pspacek
Aug 15 2018 09:38
This is super weird, I haven't seen this.
I can imagine that two different versions of kresd attemptint to use the same cache could cause this, but that's a very wild guess.
Vladimír Čunát
@vcunat
Aug 15 2018 09:43
I can't recall ever seeing
LMDB error: Resource temporarily unavailable
Petr Špaček
@pspacek
Aug 15 2018 09:46
Right, that's another weird thing.
Vladimír Čunát
@vcunat
Aug 15 2018 09:49

man mmap shows:

EAGAIN The file has been locked, or too much memory has been locked (see setrlimit(2)).

I wonder if the container might impose some limits that you don't normally get.
Vladimír Čunát
@vcunat
Aug 15 2018 09:54

Or closer, it might come from mdb_env_open:

EAGAIN - the environment was locked by another process.

It seems not clear from the docs, but I suppose it might happen if many processes open the same cache at the precisely same moment - we could just e.g. sleep a split-second and retry (if it's really raised by that function).

Robert Šefr
@robcza
Aug 15 2018 10:36
The processes actually open the cache at the very same moment. supervisord does not allow me to start several forks in sequence
Petr Špaček
@pspacek
Aug 15 2018 11:00
Interestingly enough this does not happen under systemd which starts forks in parallel as well ...
Robert Šefr
@robcza
Aug 15 2018 11:01
Anything I can simulate/test? Run with verbose(true) for the start
Vladimír Čunát
@vcunat
Aug 15 2018 11:02
I suspect that verbose mode won't provide more related logs in this case.
Robert Šefr
@robcza
Aug 15 2018 11:03
I can introduce random delay on the process start to avoid the race condition and verify it happens only when running at the very same moment
Petr Špaček
@pspacek
Aug 15 2018 12:24
That would be a good first step.