These are chat archives for beniz/deepdetect

28th
Jun 2018
cchadowitz-pf
@cchadowitz-pf
Jun 28 2018 17:28
Thanks again for merging in #425 (and catching that typo in ut_dlibapi linking). out of curiosity, have you run into any situations where CUDA memory management has become problematic due to the threading of the various services? More specifically, if you have 2+ services using the same GPU (like 2 Caffe services, for example), have you noticed anything odd with GPU memory usage? I'm still working off a hunch so this is fishing in the dark a bit.....
Emmanuel Benazera
@beniz
Jun 28 2018 17:32
No pb. Never. We stack many models on single GPU, bit in practice we don't hammer them too much. What are the problems you are running into ? It's notoriously hard to debug nvidia drivers and GPU memory issues.
cchadowitz-pf
@cchadowitz-pf
Jun 28 2018 17:36
Hah yeah. Well currently I'm seeing it with dlib models, I haven't yet tried to replicate it with Caffe or TF models. But basically if I create two dlib model services on the same GPU and make a handful of prediction calls with varying batch sizes, it doesn't seem to be releasing the GPU memory consistently so over time it throws a cuda out of memory error for a given batch size that had previously worked. I.e. if batches up to 50 worked initially, after some period of time of varying the batch sizes the memory usage fluctuates up and down but trends upwards overall so later on that same call of a batch of 50 runs out of memory. Currently have a ticket open with dlib to try to look into it, but I'm so far unable to replicate it in a self-contained app (though still using the opencv+dlib libs that deepdetect is using).
So I was thinking that maybe the fact that in Deepdetect the services are threaded may play into it somehow, as my self-contained app is not using multiple threads.
Emmanuel Benazera
@beniz
Jun 28 2018 17:37
Other kind than caffe, like TF, use a default mode of taking over the full gpu memory
Your issue sounds different but it's possible that doin assumed that allocated memory is for reuse
When we stack models on GPUs we tend to fix the batch size of each model
If only to not fail some of the calls
cchadowitz-pf
@cchadowitz-pf
Jun 28 2018 17:38
from my experience TF greedily takes as much GPU memory as possible on init, yeah, but Caffe seems to behave more like dlib where it allocates as much memory as required on demand (initial amount for model, then any additional for the batched images).
Emmanuel Benazera
@beniz
Jun 28 2018 17:39
TF has a grow memory mode we use
cchadowitz-pf
@cchadowitz-pf
Jun 28 2018 17:39
so varying the batch sizes exacerbates the issue and lets me replicate it more quickly, but the trend of memory usage climbing is visible even with the same batch size, just takes much much longer to become apparent, i think
Emmanuel Benazera
@beniz
Jun 28 2018 17:39
Let us know how it goes with caffe
It's possible you may have to pre fill your batches before hitting dd
cchadowitz-pf
@cchadowitz-pf
Jun 28 2018 17:40
pre fill how exactly?
Emmanuel Benazera
@beniz
Jun 28 2018 17:41
I mean ensure a fixed batch size
But if doin grows memory even with fixed batch size, there's an issue somewhere
cchadowitz-pf
@cchadowitz-pf
Jun 28 2018 17:42
heh yeah. just strange that i'm having trouble replicating this outside of deepdetect services. it's possible that the dlib integration has an issue somewhere...
Emmanuel Benazera
@beniz
Jun 28 2018 17:42
I'm pretty certain this does not happen with caffe, we have gpu jobs up for 8 months+
Does it happen with a single service only ?
cchadowitz-pf
@cchadowitz-pf
Jun 28 2018 17:44
doesn't seem to, but i should try one service with much larger batch sizes to see for sure. i have logging messages in the dlib lib now to indicate when cudaMalloc and cudaFree are being called, and it does seem that the cudaFree calls aren't necessarily happening immediately after the prediction is complete (which isn't necessarily bad....). but maybe the memory isn't being reused properly or something. let me see if i can get it to happen with one service
Emmanuel Benazera
@beniz
Jun 28 2018 17:45
Yes you may not want to deallocate too quickly
I'm not sure caffe deallocate, but rather reallocate instead
cchadowitz-pf
@cchadowitz-pf
Jun 28 2018 17:47
hmm
Emmanuel Benazera
@beniz
Jun 28 2018 17:47
Detection models produce a variable number of bboxes and memory usage may vary slightly even with a fixed batch size.
cchadowitz-pf
@cchadowitz-pf
Jun 28 2018 17:48
well for my tests i'm using a single image batched multiple times, so the bboxes should be stable in that situation
Emmanuel Benazera
@beniz
Jun 28 2018 17:48
Ok smart
cchadowitz-pf
@cchadowitz-pf
Jun 28 2018 17:48
and the bboxes and general memory usage shouldn't affect GPU memory i imagine?
Emmanuel Benazera
@beniz
Jun 28 2018 17:50
In caffe I don't remember exactly but there's a pruning part in the end that possibly reshape gpu memory output. I'd have to look. Was just mentioning in case doin does something similar and mem would not be release
Dlib does.. autocorrect
cchadowitz-pf
@cchadowitz-pf
Jun 28 2018 17:51
hmm interesting, not sure if dlib does something similar or not....
cchadowitz-pf
@cchadowitz-pf
Jun 28 2018 17:57
cool, i was able to trigger it with a single service, so i think that can rule out any weird threading+cuda issues...
just meant the batch size had to be larger
essentially my gpu can handle a batch size of about 148 for the particular image i'm using before running out of memory, so i start with a few predict calls with that batch size, then alternate the batch size from 1 - 148 a few times, then a few batch size 1 calls, then i do a sequence if predicts of steadily increasing batch sizes 1, 2, 10, 20, 50, 100, 148 and this time around the 148 runs out of memory.