These are chat archives for beniz/deepdetect

3rd
Nov 2017
rperdon
@rperdon
Nov 03 2017 14:42
I'm planning on training a model within DD using the same images that I used before for my digits/alexnet model. I have dd gpu downloaded and I have run a dede server on localhost with a folder mapped to models and folder mapped to images
I'm following the instructions there
now keep in mind, I'm running it from within a docker interface
curl -X PUT "http://localhost:8080/services/ddanimemodel" -d '{
"mllib":"caffe",
"description":"anime classifier",
"type":"supervised",
"parameters":{
"input":{
"connector":"image",
"width":227,
"height":227
},
"mllib":{
"template":"googlenet",
"nclasses":5
}
},
"model":{
"templates":"../templates/caffe/",
"repository":"anime"
}
}'
I run this to start my model server I guess; (I changed the nclasses to 2 (anime or not)
Where I am disconnected in my head is how to reconcile my internal folders to DD
rperdon
@rperdon
Nov 03 2017 14:47
I have a "anime source images" folder say its called anime, within which there is Anime and Non-Anime, each folder containing around 15K-20K images of the corresponding type
Emmanuel Benazera
@beniz
Nov 03 2017 14:49
Hi, when using straight curl calls, keep an eye on the server logs for errors
otherwise, use the Python client for remote control over the server via API
rperdon
@rperdon
Nov 03 2017 14:49
Does DD automatically base it on folder structure saying anything within the contained folder is classified 0 and 1, like can I do dogs, cats, and monkeys, and be 0,1 and 2 classification sets?
Emmanuel Benazera
@beniz
Nov 03 2017 14:49
look at the examples page, as there are some useful Python script samples there, typically for monitoring output every x seconds
rperdon
@rperdon
Nov 03 2017 14:49
I was wondering how I should prepare my training data
your examples have it downloaded from another source
Emmanuel Benazera
@beniz
Nov 03 2017 14:50
follow tutorials, but basically, one folder per class and that's it
rperdon
@rperdon
Nov 03 2017 14:50
that's the confirmation I needed
and sub folders within, they are not counted?
My "class" folders of anime/non-anime also have sub folders, just want to ensure they are used as well
Emmanuel Benazera
@beniz
Nov 03 2017 14:52
it's a single directory level only, mostly due to ansi opendir but we could do recursive I guess, for huge class directories
we could implement recursive directories, anyone open an issue as needed.
rperdon
@rperdon
Nov 03 2017 14:54
so I will have to tweak my class folder so there are no sub folders then
Emmanuel Benazera
@beniz
Nov 03 2017 14:54
correct
rperdon
@rperdon
Nov 03 2017 14:55
I may need to open that issue then; our source data is hash based
lots of sub-folders
Emmanuel Benazera
@beniz
Nov 03 2017 14:55
use glob in Python and put symlinks at the root for a quick hack
rperdon
@rperdon
Nov 03 2017 14:56
thx
rperdon
@rperdon
Nov 03 2017 15:38
does the docker-gpu image automatically work with any gpu?
in the past I have had to use nvidia-docker to get some gpu access
cchadowitz-pf
@cchadowitz-pf
Nov 03 2017 15:40
nvidia-docker is a wrapper to allow the docker container to access gpu resources correctly. you'll still need to run the container via the nvidia-docker wrapper if you plan to use gpu, yeah
rperdon
@rperdon
Nov 03 2017 15:40
ok, restarting my training
I figure it was going a bit slow
cchadowitz-pf
@cchadowitz-pf
Nov 03 2017 15:41
you can also use nvidia-smi to make sure the dede process is accessing the gpu
rperdon
@rperdon
Nov 03 2017 15:42
I see dede in nvidia-smi now
rperdon
@rperdon
Nov 03 2017 15:49
how do I know its done?
cchadowitz-pf
@cchadowitz-pf
Nov 03 2017 15:55
i haven't actually trained within dd myself, but from @beniz's comment earlier i'd say you may want to look into using the python dd interface to monitor it
rperdon
@rperdon
Nov 03 2017 15:57
Conceptually I'm thinking of making a gui interface that my partners can use to do simple training of their own datasets
Emmanuel Benazera
@beniz
Nov 03 2017 15:58
you can use https://github.com/jolibrain/dd_board that's what we do ourselves
we should provide more integrated scripts, for now they have not been released
rperdon
@rperdon
Nov 03 2017 16:00
Our partners want to get into machine learning, but their resources are limited (not much access to Computer Science majors in a lot of their tech groups), my goal is to find a way to get the entry into this as simple as possible
This is why digits presented itself as an interesting product
It lacked in the ability to do classification queries like DD does
I'll read into the dd_board to get a handle on it better
Emmanuel Benazera
@beniz
Nov 03 2017 16:02
anyways, the server logs report the process and you can get any of the training values through GET /train calls, see https://deepdetect.com/api/#get-information-on-a-training-job
I dont want to comment too much on the 'getting into ML without knowledge of it', though ML will be the next commodity on the stack, we don't have doubt about it and that's why we're doing what we're doing. Now, in practice until the moment it's really a commodity, I'd say it's not too good to put models into production without some double check by a CS person.
rperdon
@rperdon
Nov 03 2017 16:06
I agree that every agency should have on hand some people equipped with CS knowledge in order to get into this, but the budgets and the "material" dealt with is not exactly a world class attracting job description
The desire to learn is there, which I applaud and fully support. I believe in the technology to do some good, and my superiors agree
rperdon
@rperdon
Nov 03 2017 17:50
Do you know of any utility which can verify that "jpeg" are jpeg?
Emmanuel Benazera
@beniz
Nov 03 2017 17:51
Identify
rperdon
@rperdon
Nov 03 2017 17:56
How well does DD handle truncated, corrupt files in the training folder?
I'm getting "error" when I check the status
rperdon
@rperdon
Nov 03 2017 18:58
I'm using jpeg info to identify all the broken files; its more complete
When I run this function, I get an error during the training part
correction to port
 curl -X PUT "http://localhost:9999/services/ddanimemodel" -d '{
   "mllib":"caffe",
   "description":"anime classifier",
   "type":"supervised",
   "parameters":{
     "input":{
       "connector":"image",
       "width":227,
       "height":227
     },
     "mllib":{
       "template":"googlenet",
       "nclasses":2
     }
   },
   "model":{
     "templates":"../templates/caffe/",
     "repository":"/source"
   }
 }'
I ran this as per tutorial
 curl -X POST "http://localhost:9999/train" -d '{
   "service":"ddanimemodel",
   "async":true,
   "parameters":{
     "mllib":{
       "gpu":true,
       "net":{
         "batch_size":32
       },
       "solver":{
         "test_interval":500,
         "iterations":30000,
         "base_lr":0.001,
         "stepsize":1000,
         "gamma":0.9
       }
     },
     "input":{
       "connector":
       "image",
       "test_split":0.1,
       "shuffle":true,
       "width":224,
       "height":224
     },
     "output":{
       "measure":["acc","mcll","f1"]
     }
   },
   "data":["ddanimemodel"]
 }'
I ran this to start training
I ran jpeginfo to find all "corrupt" jpg files in the folder
I moved them all out
inside source is Anime/Non-Anime
cchadowitz-pf
@cchadowitz-pf
Nov 03 2017 19:54
what does the server log say?
rperdon
@rperdon
Nov 03 2017 19:54
0 bytes
I went into var/log and checked the deepdetect.log
nothing in it
cchadowitz-pf
@cchadowitz-pf
Nov 03 2017 19:55
are you running the server via docker?
rperdon
@rperdon
Nov 03 2017 19:55
as per how we discussed before, I was able to start docker and open the port to the system
I can send curl commands to it
the first command says it loads the service 201
then the 2nd curl command starts the training
but on checking the job status, it indicates error
cchadowitz-pf
@cchadowitz-pf
Nov 03 2017 19:56
if you find the container ID for your docker container using docker ps, you can tail the stdout of the container with docker logs -f <first few chars of the container ID>
e.g. if my container ID is 0dbf4ad80649 i can tail the stdout with docker logs -f 0db
it'll probably give a bit more info to work with about what's going wrong
rperdon
@rperdon
Nov 03 2017 19:59
 curl -X POST "http://localhost:9999/train" -d '{

   "service":"ddanimemodel",

   "async":true,

   "parameters":{

     "mllib":{

       "gpu":true,

       "net":{

         "batch_size":32

       },

       "solver":{

         "test_interval":500,

         "iterations":30000,

         "base_lr":0.001,

         "stepsize":1000,

         "gamma":0.9

       }

     },

     "input":{

       "connector":

       "image",

       "test_split":0.1,

       "shuffle":true,

       "width":224,

       "height":224

     },

     "output":{

       "measure":["acc","mcll","f1"]

     }

   },

   "data":["/source"]

 }'
I just realized that the last part data refers to the directory
So as a quick note; my settings indicates I will go through 34k images 30000 times?
wondering if I should lower that number...
cchadowitz-pf
@cchadowitz-pf
Nov 03 2017 20:02
just glancing at it quickly, looks like it, roughly, if you have 34k images (10% set aside for testing though, i believe)
rperdon
@rperdon
Nov 03 2017 20:02
Also jpeg info does not find truncated files, but deepdetect doesn't quit if it finds an truncated jpeg
INFO - 20:04:02 - Processed 30000 files.
Premature end of JPEG file
Premature end of JPEG file
Premature end of JPEG file
Premature end of JPEG file
Premature end of JPEG file
INFO - 20:04:09 - Processed 31000 files.
Premature end of JPEG file
Corrupt JPEG data: premature end of data segment
Premature end of JPEG file
Premature end of JPEG file
INFO - 20:04:16 - Processed 32000 files.
INFO - 20:04:23 - Processed 33000 files.
INFO - 20:04:24 - Processed 33056 files.
Premature end of JPEG file
Premature end of JPEG file
INFO - 20:04:24 - Opened lmdb /source/test.lmdb
Corrupt JPEG data: 5749 extraneous bytes before marker 0xd9
I also hit ctrl-c which stopped me after 1 iteration
I will lower that to maybe 50 iterations
cchadowitz-pf
@cchadowitz-pf
Nov 03 2017 20:04
cool - looks like you got the stdout working too
rperdon
@rperdon
Nov 03 2017 20:10
I think with that I am done for the day
I will leave it running for a few hundred iterations
cchadowitz-pf
@cchadowitz-pf
Nov 03 2017 20:12
good luck! :)
Emmanuel Benazera
@beniz
Nov 03 2017 20:26
Don't use stepsize or gamma here for now. Iterations are the number of times a batch is sent, 30k is fine. The images are all put into an lmdb database before training starts. That's what you are seeing in the logs. Just let it work and training will unfold.
The corrupt images will be ignored.
rperdon
@rperdon
Nov 03 2017 20:27
/opt/deepdetect/src/caffelib.cc:785: exception while forward/backward pass through the network
what does this mean?
also remove stepsize and gamme right now?
Emmanuel Benazera
@beniz
Nov 03 2017 20:27
Gpu Memory or other error
rperdon
@rperdon
Nov 03 2017 20:28
I have 16 gb available...
for gpu mem
Emmanuel Benazera
@beniz
Nov 03 2017 20:28
Something else then
Use 224 as image size not 227
rperdon
@rperdon
Nov 03 2017 20:29
ok
Emmanuel Benazera
@beniz
Nov 03 2017 20:31
Delete what's in your source dir
rperdon
@rperdon
Nov 03 2017 20:32
I cleared all the files before rerunning
I corrected to 224 for both
and removed the gamma and stepsize
Emmanuel Benazera
@beniz
Nov 03 2017 20:33
Use async false is you do not want to lose hand on the job at least for debug purposes
rperdon
@rperdon
Nov 03 2017 20:34
should I stop it and re-run again?
for the async change?
Emmanuel Benazera
@beniz
Nov 03 2017 20:34
No see whether it errors again
rperdon
@rperdon
Nov 03 2017 20:35
So is that really 30k iterations?
Also same error again
INFO - 20:37:31 - Solver scaffolding done.[20:37:31] /opt/deepdetect/src/caffelib.cc:785: exception while forward/backward pass through the network
Emmanuel Benazera
@beniz
Nov 03 2017 20:38
what's your GPU again ?
try gpu:false for debug purposes
rperdon
@rperdon
Nov 03 2017 20:46
GPU is tesla P100
16gb
Emmanuel Benazera
@beniz
Nov 03 2017 20:47
try with gpu:false, if it works, we'll look at your build
cuda compute code for P100 is 6.0
rperdon
@rperdon
Nov 03 2017 20:48
I think I was running the min compatible version for cuda for p100
Emmanuel Benazera
@beniz
Nov 03 2017 20:50
I don't think the docker builds or 'regular' caffe builds are including 6.0
have you built your own docker ?
rperdon
@rperdon
Nov 03 2017 20:50
I was running nvidia-docker on your deep-detect gpu build
ERROR - 20:52:26 - service ddanimemodel training call failed
I guess I will try gpu false now
Emmanuel Benazera
@beniz
Nov 03 2017 20:51
thanks
rperdon
@rperdon
Nov 03 2017 20:52
{"status":{"code":500,"msg":"InternalError","dd_code":1007,"dd_msg":"src/caffe/util/im2col.cu:61 / Check failed (custom): (error) == (cudaSuccess)"}}
looks like a cuda error
Emmanuel Benazera
@beniz
Nov 03 2017 20:53
I'll rebuild the docker with support for 6.0
rperdon
@rperdon
Nov 03 2017 20:53
isn't cuda at version 8?
Emmanuel Benazera
@beniz
Nov 03 2017 20:54
try the cpu docker version if you like in order to make sure everything else is fine, then you'll be able to switch to a fresh one
it's not cuda, its the compute number of the cards
rperdon
@rperdon
Nov 03 2017 20:54
I will try that out
ah k
rperdon
@rperdon
Nov 03 2017 20:55
if this makes any sense, I may be moving on to tesla v100's by Jan
i wonder if still compute version 6 on that
I appreciate the assistance, I will leave this running over the weekend
Emmanuel Benazera
@beniz
Nov 03 2017 20:58
that's OK, the docker image will be rebuilt with support for 6.0 by then
rperdon
@rperdon
Nov 03 2017 21:00
gpu false works
its the cuda compute version
Emmanuel Benazera
@beniz
Nov 03 2017 21:01
OK, we'll try that then
rperdon
@rperdon
Nov 03 2017 21:01
This network produces output probt
INFO - 20:59:46 - Network initialization done.
INFO - 20:59:46 - Solver scaffolding done.[21:00:24] /opt/deepdetect/src/caffelib.cc:812: smoothed_loss=10.5908
So I anticipate this will take a very long time off cpu