Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Aug 15 05:16
    iamdroppy commented #1451
  • Aug 12 19:02
    cchadowitz-pf review_requested #1452
  • Aug 12 19:02
    cchadowitz-pf review_requested #1452
  • Aug 12 17:25
    cchadowitz-pf synchronize #1452
  • Aug 12 17:19
    cchadowitz-pf synchronize #1452
  • Aug 12 16:33
    iamdroppy commented #1451
  • Aug 12 15:25
    fantes commented #1452
  • Aug 12 15:24
    fantes synchronize #1452
  • Aug 12 15:09
    cchadowitz-pf synchronize #1452
  • Aug 12 15:04
    mergify[bot] commented #1452
  • Aug 12 15:04
    fantes commented #1452
  • Aug 12 15:00
    mergify[bot] unlabeled #1448
  • Aug 12 15:00

    mergify[bot] on master

    feat(torch): add multilabel cla… (compare)

  • Aug 12 15:00
    mergify[bot] closed #1448
  • Aug 12 15:00
    mergify[bot] labeled #1448
  • Aug 12 08:47
    Bycob synchronize #1453
  • Aug 12 08:41
    mergify[bot] review_requested #1453
  • Aug 12 08:41
    mergify[bot] review_requested #1453
  • Aug 12 08:40
    Bycob labeled #1453
  • Aug 12 08:40
    Bycob labeled #1453
dgtlmoon
@dgtlmoon
[2022-02-04 16:06:58.042] [torchlib] [info] Initializing net from parameters: 
[2022-02-04 16:06:58.042] [torchlib] [info] Creating layer / name=tdata / type=AnnotatedData
[2022-02-04 16:06:58.042] [torchlib] [info] Creating Layer tdata
[2022-02-04 16:06:58.042] [torchlib] [info] tdata -> data
[2022-02-04 16:06:58.042] [torchlib] [info] tdata -> label
terminate called after throwing an instance of 'CaffeErrorException'
  what():  ./include/caffe/util/db_lmdb.hpp:15 / Check failed (custom): (mdb_status) == (0)
 0# dd::OatppJsonAPI::abort(int) at /opt/deepdetect/src/oatppjsonapi.cc:255
 1# 0x00007F24AC053210 in /lib/x86_64-linux-gnu/libc.so.6
 2# raise in /lib/x86_64-linux-gnu/libc.so.6
 3# abort in /lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F24AC477911 in /lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F24AC48338C in /lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F24AC4833F7 in /lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F24AC4836A9 in /lib/x86_64-linux-gnu/libstdc++.so.6
 8# caffe::db::MDB_CHECK(int) at ./include/caffe/util/db_lmdb.hpp:15
 9# caffe::db::LMDB::Open(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, caffe::db::Mode) at src/caffe/util/db_lmdb.cpp:40
10# caffe::DataReader<caffe::AnnotatedDatum>::Body::InternalThreadEntry() at src/caffe/data_reader.cpp:95
11# 0x00007F24B0DC343B in /lib/x86_64-linux-gnu/libboost_thread.so.1.71.0
12# start_thread in /lib/x86_64-linux-gnu/libpthread.so.0
13# __clone in /lib/x86_64-linux-gnu/libc.so.6

Aborted (core dumped)

current jolibrain/deepdetect_cpu with https://www.deepdetect.com/downloads/platform/pretrained/ssd_300/VGG_rotate_generic_detect_v2_SSD_rotate_300x300_iter_115000.caffemodel

used to work before i think.. checking GPU version

hmm maybe related to a permissions issue in the model dir? src/caffe/util/db_lmdb.cpp LMDB etc related.. maybe.. checking
x/detection$ ls -al model/train.lmdb/
total 29932
drwxr--r-- 2 dgtlmoon dgtlmoon     4096 Feb  4 17:10 .
drwxrwxrwx 3 dgtlmoon dgtlmoon     4096 Feb  4 17:10 ..
-rw-r--r-- 1 dgtlmoon dgtlmoon 30638080 Feb  4 17:10 data.mdb
-rw-r--r-- 1 dgtlmoon dgtlmoon     8192 Feb  4 17:10 lock.mdb
looks atleast like its writing, I'm nuking that dir when i create the service, so these are always fresh
Emmanuel Benazera
@beniz
@dgtlmoon it doesn't matter if some objects are on some images only if this is your question
make sure you have the writing permissions overall since this is being access by the docker image.
dgtlmoon
@dgtlmoon
@beniz i'm doing docker stuff for many years now, that was the first thing i checked :)
dd@5c643a8f4b39:/tags_dataset/bottom/model$ mkdir foobar
dd@5c643a8f4b39:/tags_dataset/bottom/model$ touch foobar/ok.txt
dd@5c643a8f4b39:/tags_dataset/bottom/model$
I can see that train.lmdb/data.mdb is created without problem brand new every time, and THEN the segfault happens
the setup of the service works fine, then I call the train method, I see the MDB is created every time, and i get segfault

curl -X POST "http://localhost:8080/train" -d '
{
  "service": "location",
  "async": true,
  "parameters": {
    "input": {
      "db": true,
      "db_width": 512,
      "db_height": 512,
           "width": 300,
           "height": 300

    },
    "mllib": {
      "resume": false,
      "net": {
        "batch_size": 20,
        "test_batch_size": 12
      },
      "solver": {
        "iterations": 50000,
        "test_interval": 500,
        "snapshot": 1000,
        "base_lr": 0.0001
      },
      "bbox": true
    },
    "output": {
      "measure": [
        "map"
      ]
    }
  },
  "data": [ "/tags_dataset/bottom/bottom-images.txt" ]
}
'
dgtlmoon
@dgtlmoon
I even chmod 777'ed the model dir just to be safe
dgtlmoon
@dgtlmoon
hmm ok, tried some earlier tags like v0.15.0 v0.17.0 which I remember worked, same error, so I'm doing something weird here somehow
dgtlmoon
@dgtlmoon

got it.. needs some check put in the code I think

Forgot to add test AND train lists

  "data": [ "/tags_dataset/bottom/train.txt" ]

will cause the crash

Emmanuel Benazera
@beniz
ah thanks, this should be caught by unit tests, and an error returned instead.
dgtlmoon
@dgtlmoon
done :)
dgtlmoon
@dgtlmoon

I love when you get simple problems to solve that goto amazing accuracy in a few iterations

"map_hist": [ 0.5956036100784937, 0.8226678265879551, 0.9506028971324364, 0.9687120678524176 ] :)

dgtlmoon
@dgtlmoon
I tried using a single file of images to train on, and added "parameters": { "input": { "test_split": 0.10 to both service setup and service train calls, but I still get that same segfault
Emmanuel Benazera
@beniz
object detector training with caffe has no test_split
dgtlmoon
@dgtlmoon
x)
Emmanuel Benazera
@beniz
it's going to be deprecated soon, for the torch backend with object detection. At the moment, split by hand and provide a test.txt file, the fix has been merged.
dgtlmoon
@dgtlmoon
ok I usually use caffe for absolutely no reason than that's whats in the examples, maybe I should try torch? (trying squeezenet object detector here)
yes I have a python script to split stuff for me I wrote
that's funny, so actually the bug I found was completely unrelated to what I was doing :D
dgtlmoon
@dgtlmoon
I wonder if you can run a classifier with just one class, i'll try, i think so
Nur
@nurkbts
Hi
How can use code for AI Server , easy setup
Marc Reichman
@marcreichman
Hello @beniz or anyone. Quick question - for a pretrained scenario with just classifications, is it possible to run in read-only docker? I am seeing a message couldn't write model.json file in model repository... when loading, but classifications seem to work fine. Can this message be disregarded in this case?
Emmanuel Benazera
@beniz
Hey @marcreichman yes I believe you can ignore.
Marc Reichman
@marcreichman
Thanks @beniz !
cchadowitz-pf
@cchadowitz-pf
Hi @beniz, long time no chat! I'm working on integrating some torch models into our DD configuration and ran into the following error. This is a traced pytorch model that I've successfully used (in a test environment) with LibTorch using OpenCV to load the image file into a cv::Mat, which then required color channel permutation as well as the dimensions permuted to match up with what libtorch required. I imagine the error I'm getting from DD is probably related to something similar where the dimensions aren't lining up for some reason between what the torch dataloader is providing the model and what the model expects. Does that seem like the case to you? If so, do you have any suggestions on how to tweak things to work around it?
mllib internal error: Libtorch error:The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__.py", line 12, in forward
    model = self.model
    transforms = self.transforms
    input = torch.unsqueeze((transforms).forward(x, ), 0)
                             ~~~~~~~~~~~~~~~~~~~ <--- HERE
    return (model).forward(input, )
  File "code/__torch__/torch/nn/modules/container.py", line 12, in forward
    _1 = getattr(self, "1")
    _0 = getattr(self, "0")
    return (_1).forward((_0).forward(x, ), )
                         ~~~~~~~~~~~ <--- HERE
  File "code/__torch__/torchvision/transforms/transforms.py", line 10, in forward
    img = torch.unsqueeze(x, 0)
    input = torch.to(img, 6)
    img0 = torch.upsample_bilinear2d(input, [299, 299], False, None)
           ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    img1 = torch.squeeze(img0, 0)
    img2 = torch.round(img1)

Traceback of TorchScript, original code (most recent call last):
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py(3919): interpolate
/usr/local/lib/python3.7/dist-packages/torchvision/transforms/functional_tensor.py(490): resize
/usr/local/lib/python3.7/dist-packages/torchvision/transforms/functional.py(438): resize
/usr/local/lib/python3.7/dist-packages/torchvision/transforms/transforms.py(349): forward
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py(1098): _slow_forward
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py(1110): _call_impl
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py(141): forward
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py(1098): _slow_forward
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py(1110): _call_impl
<ipython-input-3-4afab89cb121>(19): forward
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py(1098): _slow_forward
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py(1110): _call_impl
/usr/local/lib/python3.7/dist-packages/torch/jit/_trace.py(965): trace_module
/usr/local/lib/python3.7/dist-packages/torch/jit/_trace.py(750): trace
<ipython-input-10-1646f9ee17ed>(5): <module>
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py(2882): run_code
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py(2822): run_ast_nodes
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py(2718): run_cell
/usr/local/lib/python3.7/dist-packages/ipykernel/zmqshell.py(537): run_cell
/usr/local/lib/python3.7/dist-packages/ipykernel/ipkernel.py(208): do_execute
/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py(399): execute_request
/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py(233): dispatch_shell
/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py(283): dispatcher
/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py(300): null_wrapper
/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py(556): _run_callback
/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py(606): _handle_recv
/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py(577): _handle_events
/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py(300): null_wrapper
/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py(122): _handle_events
/usr/lib/python3.7/asyncio/events.py(88): _run
/usr/lib/python3.7/asyncio/base_events.py(1786): _run_once
/usr/lib/python3.7/asyncio/base_events.py(541): run_forever
/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py(132): start
/usr/local/lib/python3.7/dist-packages/ipykernel/kernelapp.py(499): start
/usr/local/lib/python3.7/dist-packages/traitlets/config/application.py(846): launch_instance
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py(16): <module>
/usr/lib/python3.7/runpy.py(85): _run_code
/usr/lib/python3.7/runpy.py(193): _run_module_as_main
RuntimeError: It is expected input_size equals to 4, but got size 5
for comparison, in my small libtorch test environment that I successfully used this model, I used the following:
        torch::jit::script::Module module;
        module = torch::jit::load(argv[1]);

        cv::Mat img = cv::imread(argv[2]);
        cv::cvtColor(img, img, cv::COLOR_BGR2RGB);
        at::Tensor tensor_image = torch::from_blob(img.data, {  img.rows, img.cols, img.channels() }, at::kByte);
        tensor_image = tensor_image.to(at::kFloat);
        tensor_image = tensor_image.permute({ 2, 0, 1 });

        at::Tensor output = module.forward({tensor_image}).toTensor();
Emmanuel Benazera
@beniz
Certainly the permutation, cc @Bycob
cchadowitz-pf
@cchadowitz-pf
I've been trying to build DD with torchlib.cc changed to match the way my minimal working example (outside DD) works in terms of passing a tensor to the .forward() method but have not yet been able to reconcile the two to identify what the issue is on the DD side....
Louis Jean
@Bycob
Images are loaded here in DD https://github.com/jolibrain/deepdetect/blob/master/src/backends/torch/torchdataset.cc#L890
It looks like we do the same operations as you
Maybe you can try to log the dimensions of the input tensor to see if it's similar to the one in your example?
cchadowitz-pf
@cchadowitz-pf
would it make sense to do that around here? https://github.com/jolibrain/deepdetect/blob/master/src/backends/torch/torchlib.cc#L1462
something like just outputting tensor.dim()?
Louis Jean
@Bycob
Yes, that way you will see exactly what the model is taking as input
cchadowitz-pf
@cchadowitz-pf
so at that point i'm getting dim 4, i'm going to also try to output in_vals[0].dim() at https://github.com/jolibrain/deepdetect/blob/master/src/backends/torch/torchlib.cc#L1472
Louis Jean
@Bycob
The traced module is called here https://github.com/jolibrain/deepdetect/blob/master/src/backends/torch/torchmodule.cc#L315
.dim() gives you the number of dimensions, you can call .sizes() to get all the dimensions
cchadowitz-pf
@cchadowitz-pf
ah okay. and since source there is a vector of IValues, if I'm only testing with a single input image it still makes sense to do something like source[0].toTensor().sizes(), right?
Louis Jean
@Bycob
yes
cchadowitz-pf
@cchadowitz-pf
traced module forward - source[0].toTensor().dim(): 4
traced module forward - source[0].toTensor().sizes(): [1, 3, 299, 299]
so for some reason there are 4 dims for the first (and only, in this case) image in the source std::vector
Louis Jean
@Bycob
The first dimension is the batch size
cchadowitz-pf
@cchadowitz-pf
and i imagine the error is because then source has dim 5?
i thought source was the batch, not source[0]?
Louis Jean
@Bycob
It can be a bit confusing, sometimes models have multiple tensors as input, so source contains each argument as a batched tensor
For example with detection models training you need to pass the images and the labels, so source[0] contains all the images of the batch and source[1] contains all the labels
cchadowitz-pf
@cchadowitz-pf
ah right
so in my minimal working example, i'm doing something like at::Tensor output = module.forward({tensor_image}).toTensor(); where tensor_image has size [3, 299, 299] and so .forward() is being passed a vector where the first and only element is [3,299,299]