These are chat archives for beniz/deepdetect

14th
Dec 2016
cchadowitz-pf
@cchadowitz-pf
Dec 14 2016 17:39 UTC
Hi @beniz - resuming our conversation here. I'm happy to try debugging with gdb, but not sure how it will work given that the issue came up only after extensive processing over 24 hours. What do you suggest?
Emmanuel Benazera
@beniz
Dec 14 2016 17:43 UTC
run dd in gdb and proceed as usual. If it locks again, you can then look the threads up from inside gdb
my hunch is that you should try a recent cppnetlib and do gdb as well, two in one
libtool --mode=execute gdb ./dede
cchadowitz-pf
@cchadowitz-pf
Dec 14 2016 17:46 UTC
hmm ok - and you suggest building it like described in https://github.com/beniz/deepdetect/issues/207#issuecomment-256628677 ?
Emmanuel Benazera
@beniz
Dec 14 2016 18:21 UTC
I would but cannot guarantee this is it. You may want to follow your intuition also. And if you can reproduce, open an issue and we'll move forward from there. In all cases having gdb access when the problem occurs is the way to go.
cchadowitz-pf
@cchadowitz-pf
Dec 14 2016 18:22 UTC
Ok, thanks. I'm rusty with gdb. Is the command libtool --mode=execute gdb ./dede what I'd need to execute for now?
Do I need to set any breakpoints or anything?
Emmanuel Benazera
@beniz
Dec 14 2016 18:28 UTC
that's the command yes, and run as usual
that the prediction threads are locked is not a good sign. Any calls that did fail before the locking occurred ?
Marc Reichman
@marcreichman
Dec 14 2016 18:43 UTC
Hello @beniz , I'm working with @cchadowitz-pf on this. This is obviously a hard one to reproduce and pin down, do you know if by chance i can attach to an existing instance with gdb to get the state you're looking for if it locks up again?
We didn't see any failed calls in one of the cases before the lock.. just valid 200s to predict api
Emmanuel Benazera
@beniz
Dec 14 2016 18:52 UTC
I think you need to run from scratch into gdb
unfortunately... AFAIK
Marc Reichman
@marcreichman
Dec 14 2016 18:53 UTC
I took one of the platforms where it happened offline, and booted its docker container into bash, apt-getted the appropriate bits and i think i'm going to just smash it from the outside from 4 threads with gnu parallel, hopefully i can bring it out
Emmanuel Benazera
@beniz
Dec 14 2016 18:55 UTC
let's see
are the 4 threads hitting the same service ?
Marc Reichman
@marcreichman
Dec 14 2016 19:00 UTC
yes. that's the model we're in now. to me, the question at hand is basically SHOULD this work, or have we just been lucky? i know caffe doesn't like reuse of its nets, etc. and in the past we've written apps which guard against concurrent use. just don't know what dede does in this regard
Emmanuel Benazera
@beniz
Dec 14 2016 19:01 UTC
it should but it is not efficient: the net is locked by a predict call
so your 4 concurrent calls are in fact processed sequentially
Marc Reichman
@marcreichman
Dec 14 2016 19:02 UTC
ok, good to know. and is that at the service level? we are contemplating 4 threads to 4 services on one process, or 4 threads to 4 separate processes with 1 service each
Emmanuel Benazera
@beniz
Dec 14 2016 19:02 UTC
CPU is maxed out by each predict call
the way to do it is to increase your batches
Marc Reichman
@marcreichman
Dec 14 2016 19:03 UTC
in this particular application that's not good for us unfortunately
Emmanuel Benazera
@beniz
Dec 14 2016 19:03 UTC
if it doesn't fit your application, then you can create four services and hit each with a single client thread and look if the load is ok
Marc Reichman
@marcreichman
Dec 14 2016 19:05 UTC
but, from the perspective of thread safety and CPU usage, you see no real difference of 4 services on 1 process vs 4 processes with 1 service each?
Emmanuel Benazera
@beniz
Dec 14 2016 19:09 UTC
4 processes (servers) are safer regarding threads but will listen on different ports.
I m trying to understand where a deadlock could happen with your current usage but don't see it yet
Marc Reichman
@marcreichman
Dec 14 2016 19:09 UTC
i'm about to try to kick off my 4 threads on 1 service test, hopefully it'll happen quickly enough
Emmanuel Benazera
@beniz
Dec 14 2016 19:23 UTC
are you running docker ?
Marc Reichman
@marcreichman
Dec 14 2016 19:27 UTC
yes
Marc Reichman
@marcreichman
Dec 14 2016 19:35 UTC
so, i wrote a quick script which runs curl in parallel 6 times, and then i'm running that script in parallel in a thread pool of 4 threads. so far so good, and i may crank up the numbers soon-ish. pardon my inexperience with gdb, but is there a way to get traces from all threads into an output file easily?
Emmanuel Benazera
@beniz
Dec 14 2016 19:38 UTC
not sure... there's an interactive shell
regarding docker, I cannot say whether it is qualified for production, at least we as a company would not guarantee it.
though the locking issue should be unrelated
cchadowitz-pf
@cchadowitz-pf
Dec 14 2016 20:21 UTC
as an aside, is the TF backend also locked by a predict call? or is that specific to caffe for the moment?
Emmanuel Benazera
@beniz
Dec 14 2016 21:31 UTC
Tf has the predict lock as well though TF has no lock on nets within a session. At the moment this is because batches are the best way and it is more secured. But you could remove it and test. It's not a feature though, so you need to comment out the locking and rebuild for that.