Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
elbamos
@elbamos
I'm trying to continue an experiment. I got to 2000 epochs training with accUpdate, and now I'd like to continue but with accUpdate turned off so I can experiment with momentum and gradient clipping. When I do that, I get this error:
/usr/local/share/lua/5.1/nn/Linear.lua:99: invalid arguments: CudaTensor number CudaTensor CudaTensor expected arguments: *CudaTensor~2D* [CudaTensor~2D] [float] CudaTensor~2D CudaTensor~2D | *CudaTensor~2D* float [CudaTensor~2D] float CudaTensor~2D CudaTensor~2D stack traceback:
It comes out of module:backward() and, tracing back, seems to imply that gradInput is of the wrong type. I'm using the stock callback function (mostly). Can anyone suggest how to track this down? I'm trying to avoid spending a day diving into the dirty bits of where dp, dpnn, and nn intersect.
Soumith Chintala
@soumith
@elbamos ouch. you can just go into a debugger and see at which layer this occurs and track the buffers coming from before and after are. I recommend mobdebug https://github.com/pkulchenko/MobDebug , really simple to use and set breakpoints.
elbamos
@elbamos
hey thanks! I'd given up on using torch with a real debugger
Lior Uzan
@ghostcow
@elbamos did you remove the nn.Convert() layer or something? I had similar issues when I removed mine by accident
it probably has nothing to do with it though, because my trouble was with the forward pass.
elbamos
@elbamos
@ghostcow I did not. I suspect it may relate to a change I made in using Serial layers. Either way, I was able to jumpstart the network into starting to learn again -- now at epoch 2660, the validation loss has dropped by 30%, and I have some pathways to keep it learning for another 1500 epochs or so if it starts to stall again. So the emergency has passed :)
I will say, though, I think its dpnn, but something in the framework gobbles up a lot of GPU-RAM if you turn off in-place updates.
I should not be blowing out RAM on a network with three inception layers on a 12GB titan just because I turn momentum on.
cjviper
@cjviper
hi - on the subject of memory issues, when training CNNs using the convolutionneuralnetwork script from dp, at the end of an epoch I'm noticing big jumps in memory usage on the GPU. It starts out at ~350Mb during the epoch, then as soon as the 1st epoch completes, it jumps to over 2Gb. Before I go digging, does anyone know what happens at the end of an epoch that could cause such a large spike in memory usage? I know the validation set is evaluated at the end of an epoch, but I tested on a much reduced validation set size with only a handful of images, and still got the same memory jump.
elbamos
@elbamos
@cjviper that's what I'm seeing as well (not the same model though). I traced it to the callback. If I switch to accUpdate, the issue goes away.
cjviper
@cjviper
@elbamos what are you training? CNN?
elbamos
@elbamos
accUpdate should be more memory efficient, but I'm seeing about 4x the GPU RAM consumption with accUpdate off.
yeah I have cnn modules, but I don't know that that's it. I'm using cudnn so dp isn't providing the cnn code.
cjviper
@cjviper
I think my problem was different. Something to do with the sampling method used for the validation set - it uses crops of the main image (i.e. TL, TR, BL, BR + center) which increases the data set size.
https://github.com/nicholas-leonard/dp/blob/master/examples/convolutionneuralnetwork.lua#L213
I changed the above line to reduce the batch size in the validator by a factor of 10 and it solved my problem.
Justin Payan
@justinpayan
Could anyone suggest tutorials or resources for learning how to implement tree-structured recursive neural networks (i.e., not flat recurrent neural networks) in dp and torch? Thank you!
elbamos
@elbamos
Justin Payan
@justinpayan
@elbamos Wow, thank you!
Nilesh Kulkarni
@nileshkulkarni

Hey,
I am trying to run this example recurrentlanguagemodel.lua using the penn tree dataset
When I train it with cuda flag as false it runs perfectly file.
But on trying to run with cuda it gets a segmentation fault.

Following is my log
http://pastebin.com/PH9R0QMF

Any debugging helps would be great. How to go about solving this.

Thanks,
Nilesh

arunpatala
@arunpatala
Hi is there a way to do data augmentation (such as rotation,crop etc) with dpnn? Any example code would also be helpful? Thanks
elbamos
@elbamos
@arunpatala its absolutely possible, I do it.
arunpatala
@arunpatala
Any pointers on how to approach that ? @elbamos
elbamos
@elbamos
Be careful with allocating your buffers if you plan to multithread.
cjviper
@cjviper
@arunpatala I perform data augmentation explicitly in advance, by using graphics magick commands. I wrote simple shell scripts that perform the initial resizing along with the crops/rotations etc.
Sanuj Sharma
@sanuj
@elbamos do you know how to change when the model will be saved while training a neural net? Currently it happens when there is minimum validation error.
elbamos
@elbamos
Yes, I do. And no, it doesn't.
Sanuj Sharma
@sanuj
@elbamos can you point me to the code that controls the saving of model. And what's the default criteria of saving of the model and how do i get to know about it?
elbamos
@elbamos
@sanuj Read the documentation for the Observer class and for the ErrorMinima class and it superclassers and subclasses.
Sanuj Sharma
@sanuj
Thanks @elbamos . That was helpful
Sanuj Sharma
@sanuj
Due to limited RAM i have multiple data-sets for training. I want to change the dataset after each epoch but the lua garbage collector is not able to clear the old dataset to make way for the newer one.
code looks like:
function loadData(train_data, validate_data)
    ds = nil
    train = nil
    valid = nil
    train_target = nil
    valid_target = nil
    train_input = nil
    valid_input = nil
    n_valid = nil
    n_train = nil
    nuclei_train = nil
    nuclei_valid = nil
    nuclei_train = torch.load(train_data)
    nuclei_valid = torch.load(validate_data)
    nuclei_train.data = nuclei_train.data:double()
    nuclei_valid.data = nuclei_valid.data:double()
    n_valid = (#nuclei_valid.label)[1]
    n_train = (#nuclei_train.label)[1]

    train_input = dp.ImageView('bchw', nuclei_train.data:narrow(1, 1, n_train))
    train_target = dp.ClassView('b', nuclei_train.label:narrow(1, 1, n_train))
    valid_input = dp.ImageView('bchw', nuclei_valid.data:narrow(1, 1, n_valid))
    valid_target = dp.ClassView('b', nuclei_valid.label:narrow(1, 1, n_valid))

    train_target:setClasses({0, 1, 2})
    valid_target:setClasses({0, 1, 2})

    -- 3. wrap views into datasets

    train = dp.DataSet{inputs=train_input, targets=train_target, which_set='train'}
    valid = dp.DataSet{inputs=valid_input, targets=valid_target, which_set='valid'}

    -- 4. wrap datasets into datasource

    ds = dp.DataSource{train_set=train, valid_set=valid}
    ds:classes{0, 1, 2}
end
while true do
    train_data = '/home/sanuj/Projects/nuclei-net-data/fine-tune/1/train.t7'
    validate_data = '/home/sanuj/Projects/nuclei-net-data/fine-tune/1/validate.t7'
    loadData(train_data, validate_data)
    print 'Using data-set 1.'
    train_data = '/home/sanuj/Projects/nuclei-net-data/fine-tune/2/train.t7'
    validate_data = '/home/sanuj/Projects/nuclei-net-data/fine-tune/2/validate.t7'
    print 'Using data-set 2.'
    loadData(train_data, validate_data)
    xp:run(ds)
end
Sanuj Sharma
@sanuj
it is able to clear ds if i don't call xp:run(ds) between two loadData calls but the xp:run(ds) adds more references to ds i guess which stops the garbage collector to clear it. Don't know how to fix it. @elbamos can you help me with this?
elbamos
@elbamos
instead of changing datasets after each epoch, what you want to do is write a subclass of DataSet that produces the data you want after each epoch. handling memory considerations of moving training data in and our of ram is a responsibility of the DataSet and DataSource objects
Sanuj Sharma
@sanuj
@elbamos how will the subclass change the dataset after every epoch? does it need to subscribe to some event to get notified?
elbamos
@elbamos
are you asking me how it will know whne one epoch ends and another begins?
it could subsribe if you wanted to do it that way, but there's an easier way: depending on the Sampler you use, the system will decide what rows to ask for in what order. So you tell the dataset to tell dp there are however many rows there actually are, and when that number of rows has been processed, it won't ask for any more and the epoch will be over. if different parts of your dataset have different numbers of rows, then the simplest thing is to just ignore what is one epoch and what is another. you just produce batches in what order you want them processed, and when it runs out from one place it starts taking batches from someplace else. then, the length of an epoch is just however often you want to see feedback reports
Sanuj Sharma
@sanuj
i think i would have to understand the internals of dp
elbamos
@elbamos
no just study the way imagedataset etc are implemented
Sanuj Sharma
@sanuj
thanks @elbamos :smile: , i'll try this tomorrow. I hope it works.
elbamos
@elbamos
@sanuj it definitely works. There are examples included with dp specifically to show how to handle when you have a dataset that's too large to fit in memory.
Jacky Yang
@anguoyang

Hi, all,
Could anyone be kindly to help me on this issue? thanks:

We have lots of photos/images, say 10 million or more, they are original photos/images from our customers which need to be protected(To prevent plagiarism), here we call it as dataset A.
We also got lots of images by way of web crawler, from bloggers, websites, forum, etc. some of these images are simply copied from dataset A, some added with additional watermark, we call it as dataset B. it currently contains about 300000 images, but will grow day by day.
We will use 1 image or several images from dataset A, we call it as dataset C, we want to search images in B which is similar with C, and list all similar images.

We want to use deep learning for similarity search, but most of the images in dataset A has no tag, could we train these images into a specific model, then we could get more accurate result while searching similar images?

Thanks a lot for your patience to read this long requirement, and have a nice day!

elbamos
@elbamos
@anguoyang that's very similar to work I've done - you can pm me
Sanuj Sharma
@sanuj
hey @elbamos i was trying to do transfer learning with dp. I want different learning rates for each layer in my cnn. How can I do that? Here is the script that I'm using.
elbamos
@elbamos
@sanuj In dp, when you create your dp.Optimizer objection, you define a function called callback. The callback function is executed after every batch on the training set, and it performs the actual parameter updates. Your script uses the simple callback from one fo the dp examples. If you want a learning pattern other than simple SGD - like adding momentum, norms, cutoffs, etc. - you do it in dp by modifying the callback function.
@sanuj Can we assume that you were able to resolve the large-data issue you were having a few weeks ago?
Sanuj Sharma
@sanuj
@elbamos thanks for your reply. I had used ImageSourcewhich allows to read each batch from the hard drive but it was slow as I don't have an SSD. I couldn't make it read data for each epoch instead of each batch but then I don't need it anymore so didn't try further.
Jay Devassy
@jaydevassy
I have a trained convnet for object identification. I need to "run" it on a larger image to locate the target object in the larger image. How do I leverage the inbuilt convolution operation/module in torch to do this. Not worried about scale or rotation invariance at this point. Basically trying to avoid using a sliding window approach over the larger image which would be inefficient (most of the computations would be thrown away at next window position). Any ideas or pointers? Thx
elbamos
@elbamos
@jaydevassy train a conv net as a pixel classifier or build an attention model