Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    SP Mohanty
    @spMohanty
    @amir-abdi : the training can take upto 8 hours
    and yes, you can use pretrained weights to initiate the training.
    Please use git-lfs to push large model binaries to your repository
    mseitzer
    @mseitzer
    @spMohanty Can you give me pointers on what went wrong at gitlab.aicrowd.com/mseitzer/disentanglement-challenge/issues/1? Thanks!
    SP Mohanty
    @spMohanty
    @mseitzer : Your submissions code doesnot have access to the network, so if your code needs any pretrained models, please include them in the repository (over git-lfs)
    2019-07-20T15:10:32.402368823Z Downloading: "https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pth" to /home/aicrowd/.cache/torch/checkpoints/resnext101_32x8d-8ba56ff5.pth
    2019-07-20T15:10:32.40246759Z Running on cuda
    2019-07-20T15:10:32.402537816Z Training with 460800 images
    2019-07-20T15:10:32.402542748Z Training Start...
    2019-07-20T15:10:32.40254623Z Training Progress : 0.0
    2019-07-20T15:10:32.407730814Z Traceback (most recent call last):
    
    ......
    ......
    
    2019-07-20T15:10:32.407835462Z   File "/srv/conda/envs/notebook/lib/python3.6/socket.py", line 713, in create_connection
    2019-07-20T15:10:32.407838736Z     sock.connect(sa)
    2019-07-20T15:10:32.407841995Z OSError: [Errno 101] Network is unreachable
    mseitzer
    @mseitzer
    @spMohanty Thanks. I included the pretrained model, but it seems that there are further problems. Could you take another look please?
    Amir
    @amir-abdi
    @spMohanty I'm waiting for some logs to understand why my training fails after 2 hours. Thanks. https://gitlab.aicrowd.com/amirabdi/disentanglement/issues/22
    Amir
    @amir-abdi
    @spMohanty I'm still failing during evaluation and not sure whether it's my implementation or something on your end. Thanks
    SP Mohanty
    @spMohanty
    Will be at my desk in an hour and then investigate the cause of failure
    Also, a general@note that you dont have to include a call to local_evaluation.py at your end.
    We will have a separate container evaluate the dumped representations.
    Amir
    @amir-abdi
    @spMohanty thank you. But mine still failed at 99%
    mseitzer
    @mseitzer
    @spMohanty Could you give me access to some logs so I can understand what's going wrong? I really would like to submit something https://gitlab.aicrowd.com/mseitzer/disentanglement-challenge/issues/3
    Also, could you update disentanglement-lib to version 1.2 in the evaluator (if that is not done automatically)? They fixed some bugs that prevented my model from being evaluated properly.
    SP Mohanty
    @spMohanty
    @mseitzer : I pasted the logs on the issue, but your code is trying to access the external network (to download pretrained models), which is not allowed as a security measure. if you want to use pre-trained models, you will have to include them in your repository (via git-lfs)
    amd I am waiting for Olivier to push disentanglement-lib v1.2 to PyPi, before updating the evaluator
    I will check in with him soon
    and keep you folks updated here
    Amir
    @amir-abdi
    @spMohanty Any updates on why the models are all stuck on 99%?
    SP Mohanty
    @spMohanty
    @amir-abdi : From the look of it, it looks like a timeout (without proper error message), and also a lack of availability of convert for some reason.
    I agree the error propagation would be much better. but we will have a better solution for that from next week,.
    Amir
    @amir-abdi
    @spMohanty Are the local_evaluation.py and evaluate.py files need to be in the root directory (same place where run.sh is?)
    SP Mohanty
    @spMohanty
    @amir-abdi : No, the local
    -evaluation and evaluate.py can be anywhere. We call a separate version of the file. So you do not need to even include them in the repo, or invoke them via run.sh
    Amir
    @amir-abdi
    @spMohanty the best submission is not the one
    listed on the leaderboard.
    Loris Michel
    @lorismichel
    dear @spMohanty 4 submission of mine using pytorch failed today
    I suppose it is always the same error but I cannot spot it, could you kindly provide me with the error log? one if this submission is http://gitlab.aicrowd.com/loris_michel/neurips2019_disentanglement_challenge_starter_kit/issues/20
    SP Mohanty
    @spMohanty
    Do you uhave the aicrowd_helpers.submit() call ?
    yes at the end of the file
    SP Mohanty
    @spMohanty
    Pasteed the error, and also p[asting here as it might help others :
    2019-07-24T15:53:43.380486148Z Training Start...
    2019-07-24T15:53:43.382430252Z Training Progress : 0.0
    2019-07-24T15:53:43.704665965Z ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
    2019-07-24T15:53:47.100589389Z Train Epoch: 1 [0/460800 (0%)]    Loss_final: 5.233527, Loss: 0.076500, BCE: 0.076500, KLD: 0.000000, Loss_pre: 10.390554, BCE_pre: 0.055477, KLD_pre: 0.000000, PL_pre: 10.335077
    2019-07-24T15:54:59.28790464Z
    Loris Michel
    @lorismichel
    do you have an idea what it could be? to big batch size? I use 512 batch size
    SP Mohanty
    @spMohanty
    It looks like its because of RAM on the nodes
    Loris Michel
    @lorismichel
    do you have a fix to suggest to me? decreasing the batch size for example or you think it is a deeper issue?
    SP Mohanty
    @spMohanty
    Does your code need excessive amounts of memory ?
    Loris Michel
    @lorismichel
    I am ready pretrained weights from a 49MB file, then just training a model and thats it
    ready = reading
    SP Mohanty
    @spMohanty
    Loris Michel
    @lorismichel
    thanks so essentially this means setting num_workers to 0 instead of 1 in the pytorch/train_pytorch.py file if I get it right
    SP Mohanty
    @spMohanty
    Yeah
    Loris Michel
    @lorismichel
    dear @spMohanty could you kindly increase my submission limit, I just notice with the base file pytorch/train_pytorch.py you provide it also fails. So I would like to try this possible fix you suggested but I have no sub left.
    (this could also explain why I have memory issue on my gpu when I try to evaluate locally the factor vae score)
    mseitzer
    @mseitzer
    @spMohanty Regarding disentanglement lib v1.2.. Would it be possible to directly install from the github repository as a quick fix instead of waiting for the pypi package? I.e. put something like "git+git://github.com/google-research/disentanglement_lib.git@v1.2#[tf_gpu]" in the requirements.txt.
    This would be of great help, as I was not able to successfully get a single model to evaluate. And I'm sure I am not the only one with this problem (e.g. @lorismichel has this problem as well)
    SP Mohanty
    @spMohanty
    @mseitzer : Will do !
    Amir
    @amir-abdi
    @spMohanty the total execution time is not taking into account the time waited in queue for evaluation to start. Consequently, I believe my latest submission to go over the time limit.
    @spMohanty is it possible for you to check this and compensate for the wasted time if that was the case?
    SP Mohanty
    @spMohanty
    @amir-abdi : Will do !
    and @mseitzer : v1.2 is being used on the evaluator now.