Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Amir
    @amir-abdi
    @spMohanty How long can the training take? Any limits there?
    Amir
    @amir-abdi
    @spMohanty Will only the "best" submission be considered for final ranking, or the "last" submission?
    @spMohanty Is it OK to initiate training from a preTrained set of weights?
    SP Mohanty
    @spMohanty
    @amir-abdi : In case of this challenge, we are indeed considering the "best" submission for the final ranking.
    The last submission thing was quite specific to the Unity challenge, and we doubt it will ever be repeated again.
    also, please do note that the ranking is finally measured by individually ranking the submissions of a participant/team across all the metrics, and then taking a mean of the same.
    (which is not live yet !)
    SP Mohanty
    @spMohanty
    @amir-abdi : the training can take upto 8 hours
    and yes, you can use pretrained weights to initiate the training.
    Please use git-lfs to push large model binaries to your repository
    mseitzer
    @mseitzer
    @spMohanty Can you give me pointers on what went wrong at gitlab.aicrowd.com/mseitzer/disentanglement-challenge/issues/1? Thanks!
    SP Mohanty
    @spMohanty
    @mseitzer : Your submissions code doesnot have access to the network, so if your code needs any pretrained models, please include them in the repository (over git-lfs)
    2019-07-20T15:10:32.402368823Z Downloading: "https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pth" to /home/aicrowd/.cache/torch/checkpoints/resnext101_32x8d-8ba56ff5.pth
    2019-07-20T15:10:32.40246759Z Running on cuda
    2019-07-20T15:10:32.402537816Z Training with 460800 images
    2019-07-20T15:10:32.402542748Z Training Start...
    2019-07-20T15:10:32.40254623Z Training Progress : 0.0
    2019-07-20T15:10:32.407730814Z Traceback (most recent call last):
    
    ......
    ......
    
    2019-07-20T15:10:32.407835462Z   File "/srv/conda/envs/notebook/lib/python3.6/socket.py", line 713, in create_connection
    2019-07-20T15:10:32.407838736Z     sock.connect(sa)
    2019-07-20T15:10:32.407841995Z OSError: [Errno 101] Network is unreachable
    mseitzer
    @mseitzer
    @spMohanty Thanks. I included the pretrained model, but it seems that there are further problems. Could you take another look please?
    Amir
    @amir-abdi
    @spMohanty I'm waiting for some logs to understand why my training fails after 2 hours. Thanks. https://gitlab.aicrowd.com/amirabdi/disentanglement/issues/22
    Amir
    @amir-abdi
    @spMohanty I'm still failing during evaluation and not sure whether it's my implementation or something on your end. Thanks
    SP Mohanty
    @spMohanty
    Will be at my desk in an hour and then investigate the cause of failure
    Also, a general@note that you dont have to include a call to local_evaluation.py at your end.
    We will have a separate container evaluate the dumped representations.
    Amir
    @amir-abdi
    @spMohanty thank you. But mine still failed at 99%
    mseitzer
    @mseitzer
    @spMohanty Could you give me access to some logs so I can understand what's going wrong? I really would like to submit something https://gitlab.aicrowd.com/mseitzer/disentanglement-challenge/issues/3
    Also, could you update disentanglement-lib to version 1.2 in the evaluator (if that is not done automatically)? They fixed some bugs that prevented my model from being evaluated properly.
    SP Mohanty
    @spMohanty
    @mseitzer : I pasted the logs on the issue, but your code is trying to access the external network (to download pretrained models), which is not allowed as a security measure. if you want to use pre-trained models, you will have to include them in your repository (via git-lfs)
    amd I am waiting for Olivier to push disentanglement-lib v1.2 to PyPi, before updating the evaluator
    I will check in with him soon
    and keep you folks updated here
    Amir
    @amir-abdi
    @spMohanty Any updates on why the models are all stuck on 99%?
    SP Mohanty
    @spMohanty
    @amir-abdi : From the look of it, it looks like a timeout (without proper error message), and also a lack of availability of convert for some reason.
    I agree the error propagation would be much better. but we will have a better solution for that from next week,.
    Amir
    @amir-abdi
    @spMohanty Are the local_evaluation.py and evaluate.py files need to be in the root directory (same place where run.sh is?)
    SP Mohanty
    @spMohanty
    @amir-abdi : No, the local
    -evaluation and evaluate.py can be anywhere. We call a separate version of the file. So you do not need to even include them in the repo, or invoke them via run.sh
    Amir
    @amir-abdi
    @spMohanty the best submission is not the one
    listed on the leaderboard.
    Loris Michel
    @lorismichel
    dear @spMohanty 4 submission of mine using pytorch failed today
    I suppose it is always the same error but I cannot spot it, could you kindly provide me with the error log? one if this submission is http://gitlab.aicrowd.com/loris_michel/neurips2019_disentanglement_challenge_starter_kit/issues/20
    SP Mohanty
    @spMohanty
    Do you uhave the aicrowd_helpers.submit() call ?
    yes at the end of the file
    SP Mohanty
    @spMohanty
    Pasteed the error, and also p[asting here as it might help others :
    2019-07-24T15:53:43.380486148Z Training Start...
    2019-07-24T15:53:43.382430252Z Training Progress : 0.0
    2019-07-24T15:53:43.704665965Z ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
    2019-07-24T15:53:47.100589389Z Train Epoch: 1 [0/460800 (0%)]    Loss_final: 5.233527, Loss: 0.076500, BCE: 0.076500, KLD: 0.000000, Loss_pre: 10.390554, BCE_pre: 0.055477, KLD_pre: 0.000000, PL_pre: 10.335077
    2019-07-24T15:54:59.28790464Z
    Loris Michel
    @lorismichel
    do you have an idea what it could be? to big batch size? I use 512 batch size
    SP Mohanty
    @spMohanty
    It looks like its because of RAM on the nodes
    Loris Michel
    @lorismichel
    do you have a fix to suggest to me? decreasing the batch size for example or you think it is a deeper issue?
    SP Mohanty
    @spMohanty
    Does your code need excessive amounts of memory ?
    Loris Michel
    @lorismichel
    I am ready pretrained weights from a 49MB file, then just training a model and thats it
    ready = reading
    SP Mohanty
    @spMohanty
    Loris Michel
    @lorismichel
    thanks so essentially this means setting num_workers to 0 instead of 1 in the pytorch/train_pytorch.py file if I get it right
    SP Mohanty
    @spMohanty
    Yeah
    Loris Michel
    @lorismichel
    dear @spMohanty could you kindly increase my submission limit, I just notice with the base file pytorch/train_pytorch.py you provide it also fails. So I would like to try this possible fix you suggested but I have no sub left.
    (this could also explain why I have memory issue on my gpu when I try to evaluate locally the factor vae score)
    mseitzer
    @mseitzer
    @spMohanty Regarding disentanglement lib v1.2.. Would it be possible to directly install from the github repository as a quick fix instead of waiting for the pypi package? I.e. put something like "git+git://github.com/google-research/disentanglement_lib.git@v1.2#[tf_gpu]" in the requirements.txt.