Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Amir
    @amir-abdi
    @spMohanty Are the local_evaluation.py and evaluate.py files need to be in the root directory (same place where run.sh is?)
    SP Mohanty
    @spMohanty
    @amir-abdi : No, the local
    -evaluation and evaluate.py can be anywhere. We call a separate version of the file. So you do not need to even include them in the repo, or invoke them via run.sh
    Amir
    @amir-abdi
    @spMohanty the best submission is not the one
    listed on the leaderboard.
    Loris Michel
    @lorismichel
    dear @spMohanty 4 submission of mine using pytorch failed today
    I suppose it is always the same error but I cannot spot it, could you kindly provide me with the error log? one if this submission is http://gitlab.aicrowd.com/loris_michel/neurips2019_disentanglement_challenge_starter_kit/issues/20
    SP Mohanty
    @spMohanty
    Do you uhave the aicrowd_helpers.submit() call ?
    yes at the end of the file
    SP Mohanty
    @spMohanty
    Pasteed the error, and also p[asting here as it might help others :
    2019-07-24T15:53:43.380486148Z Training Start...
    2019-07-24T15:53:43.382430252Z Training Progress : 0.0
    2019-07-24T15:53:43.704665965Z ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
    2019-07-24T15:53:47.100589389Z Train Epoch: 1 [0/460800 (0%)]    Loss_final: 5.233527, Loss: 0.076500, BCE: 0.076500, KLD: 0.000000, Loss_pre: 10.390554, BCE_pre: 0.055477, KLD_pre: 0.000000, PL_pre: 10.335077
    2019-07-24T15:54:59.28790464Z
    Loris Michel
    @lorismichel
    do you have an idea what it could be? to big batch size? I use 512 batch size
    SP Mohanty
    @spMohanty
    It looks like its because of RAM on the nodes
    Loris Michel
    @lorismichel
    do you have a fix to suggest to me? decreasing the batch size for example or you think it is a deeper issue?
    SP Mohanty
    @spMohanty
    Does your code need excessive amounts of memory ?
    Loris Michel
    @lorismichel
    I am ready pretrained weights from a 49MB file, then just training a model and thats it
    ready = reading
    SP Mohanty
    @spMohanty
    Loris Michel
    @lorismichel
    thanks so essentially this means setting num_workers to 0 instead of 1 in the pytorch/train_pytorch.py file if I get it right
    SP Mohanty
    @spMohanty
    Yeah
    Loris Michel
    @lorismichel
    dear @spMohanty could you kindly increase my submission limit, I just notice with the base file pytorch/train_pytorch.py you provide it also fails. So I would like to try this possible fix you suggested but I have no sub left.
    (this could also explain why I have memory issue on my gpu when I try to evaluate locally the factor vae score)
    mseitzer
    @mseitzer
    @spMohanty Regarding disentanglement lib v1.2.. Would it be possible to directly install from the github repository as a quick fix instead of waiting for the pypi package? I.e. put something like "git+git://github.com/google-research/disentanglement_lib.git@v1.2#[tf_gpu]" in the requirements.txt.
    This would be of great help, as I was not able to successfully get a single model to evaluate. And I'm sure I am not the only one with this problem (e.g. @lorismichel has this problem as well)
    SP Mohanty
    @spMohanty
    @mseitzer : Will do !
    Amir
    @amir-abdi
    @spMohanty the total execution time is not taking into account the time waited in queue for evaluation to start. Consequently, I believe my latest submission to go over the time limit.
    @spMohanty is it possible for you to check this and compensate for the wasted time if that was the case?
    SP Mohanty
    @spMohanty
    @amir-abdi : Will do !
    and @mseitzer : v1.2 is being used on the evaluator now.
    Loris Michel
    @lorismichel
    due to the versioning of libs as pointed out by @mseitzer?
    thx a lot for your help inadvance
    Loris Michel
    @lorismichel

    CUDA out of memory. Tried to allocate 6.22 GiB (GPU 0; 11.17 GiB total capacity; 6.70 GiB already allocated; 4.17 GiB free; 4.19 MiB cached) (malloc at /opt/conda/conda-bld/pytorch_1556653099582/work/c10/cuda/CUDACachingAllocator.cpp:267)
    frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f13a088fdc5 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libc10.so)
    frame #1: <unknown function> + 0x16ca7 (0x7f13a044dca7 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
    frame #2: <unknown function> + 0x17347 (0x7f13a044e347 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
    frame #3: THCStorage_resize + 0x96 (0x7f137180e706 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
    frame #4: at::native::empty_strided_cuda(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&) + 0x4f1 (0x7f1372fa5851 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
    frame #5: at::CUDAType::empty_strided(c10::ArrayRef<long>, c10::ArrayRef<long>, c10::TensorOptions const&) const + 0x1b4 (0x7f13716ba244 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
    frame #6: at::TensorIterator::allocate_outputs() + 0x526 (0x7f136db06f66 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
    frame #7: at::TensorIterator::Builder::build() + 0x48 (0x7f136db075e8 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
    frame #8: at::TensorIterator::binary_op(at::Tensor&, at::Tensor const&, at::Tensor const&) + 0x31f (0x7f136db0843f in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
    frame #9: <unknown function> + 0x629d09 (0x7f136d966d09 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
    frame #10: at::native::threshold(at::Tensor const&, c10::Scalar, c10::Scalar) + 0x3d (0x7f136d96757d in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
    frame #11: at::TypeDefault::threshold(at::Tensor const&, c10::Scalar, c10::Scalar) const + 0x6d (0x7f136ddc87ad in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
    frame #12: at::native::relu(at::Tensor const&) + 0x5f (0x7f136d96577f in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libcaffe2.so)
    frame #13: at::CUDAType::relu(at::Tensor const&) const + 0xc2 (0x7f1371756212 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so)
    frame #14: torch::autograd::VariableType::relu(at::Tensor const&) const + 0x479 (0x7f1365e06619 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
    frame #15: <unknown function> + 0xa1d09d (0x7f13663cf09d in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
    frame #16: <unknown function> + 0xa73df8 (0x7f1366425df8 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
    frame #17: torch::jit::InterpreterState::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) + 0x22 (0x7f1366421372 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
    frame #18: <unknown function> + 0xa5b2d9 (0x7f136640d2d9 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
    frame #19: <unknown function> + 0x457f18 (0x7f13a0f00f18 in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
    frame #20: <unknown function> + 0x12ce4a (0x7f13a0bd5e4a in /srv/conda/envs/notebook/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

    <omitting python frames>
    :
    operation failed in interpreter:
    bias = _11.bias
    _12 = _0.head_mu
    weight0 = _12.weight
    bias0 = _12.bias
    input0 = torch._convolution(input, _3, _4, [2, 2], [1, 1], [1, 1], False, [0, 0], 1, False, False, True)
    input1 = torch.relu(input0)
    input2 = torch._convolution(input1, _6, _7, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, False, False, True)
    input3 = torch.relu(input2)
    input4 = torch._convol

    here is the error message I get when trying to evaluate factor vae metric. Looks too much memory used, but the model used is not overly big...
    SP Mohanty
    @spMohanty
    @lorismichel : This seems to be because of the large batch size used in the factorvae evaluation metric. If you look at the v1.2 release (on github : google-research/disentanglement-lib) , the new release addresses this problem
    Loris Michel
    @lorismichel
    thanks @spMohanty
    mseitzer
    @mseitzer
    @spMohanty Could you check what happened at https://gitlab.aicrowd.com/mseitzer/disentanglement-challenge/issues/9 please? I get "HTTPSConnectionPool(host='gitlab.aicrowd.com', port=443): Read timed out."
    Insu Jeon
    @InsuJeon
    Hi, @spMohanty. My model submission is "waiting_in_queue_for_evaluation label" for almost 16 hours. Model training stage is done long ago, but waiting time for evaluation seems to be a bit too long. Is it normal and ok? https://gitlab.aicrowd.com/isjeon/neurips2019_disentanglement_challenge_starter_kit/issues/3
    SP Mohanty
    @spMohanty
    @InsuJeon : The queue is clogged because of many submissions. And your submissions hasnt started being evaluated yet
    we are increasing capacity soon
    Insu Jeon
    @InsuJeon
    @spMohanty Thank you! :)
    Sourabh Balgi
    @sobalgi
    @spMohanty submission shows training in progress and after sometime failed https://gitlab.aicrowd.com/sourabh_balgi/neurips2019_disentanglement_challenge_starter_kit/issues/5
    @spMohanty Any logs available for debugging?
    Amir
    @amir-abdi
    @spMohanty any feedback on why this failed? https://gitlab.aicrowd.com/amirabdi/disentanglement/issues/65
    And
    Amir
    @amir-abdi
    The evaluation of the following issue never started; yet, I was let know that this is a overtime problem which doesn't sound right. Please double check. Thanks.
    ShabnamGh
    @ShabnamGh
    @spMohanty would you please help with the submission: https://gitlab.aicrowd.com/Shab7nam/neurips2019_disentanglement_challenge_shabnam/issues/5. Training is in progress but there is a failed message for evaluation!