by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Shunta Saito
    @mitmul
    -sameroom portal
    Sameroom
    @sameroom-bot
    <Sameroom> Your Portal URL is https://sameroom.io/KkARyiPY -- you can send the URL to someone on a different team to share this room. Note: you can connect more than two teams this way.
    <Sameroom> I've connected 1 new room #multinode (chainer) on Slack. See map
    Shunta Saito
    @mitmul
    @ilkarman Hi, can you see this message?
    Regarding the test of ChainerMN on Azure batch AI, let's discuss how to get reasonable results here.
    https://github.com/ilkarman/DeepLearningFrameworks/pull/85#issuecomment-393108492
    Ilia Karmanov
    @ilkarman
    @mitmul Thanks for the invite! Yes I can
    Ilia Karmanov
    @ilkarman
    The ResNet50 example I posted also matches the example we have for MNIST - https://github.com/Azure/BatchAI/blob/master/recipes/Chainer/Chainer-GPU-Distributed/train_mnist.py apart from two things: 1) Custom data-generator class, 2) Using multi-process iterator. However compared to the Chainer ResNet50 example my script would crash with 'set_start_method('forkserver') so I have omitted
    I'm not exactly sure if it's an environment issue and I should talk with the BatchAI team or just something in my training script: https://gist.github.com/ilkarman/37a4d5f44f25a4e023572a954a6b258f
    Ilia Karmanov
    @ilkarman
    Ah I see that the chainer example on Azure has processCount = nodeCount by default so before I had (for 2 nodes and 8 gpus)
    Num process (COMM_WORLD): 2
    Using hierarchical communicator
    Num Minibatch-size: 64
    Num epoch: 1
    now
    Num process (COMM_WORLD): 8
    Using hierarchical communicator
    Num Minibatch-size: 64
    Num epoch: 1
    And estimated time for one epoch is around 5000 seconds which is similar to CNTK but still a bit slower than horovod/pytorch-raw
    so fixing COMM_WORLD has improved the epoch time considerably from 5.5hrs to expected 1.5 hrs (on 8 K80s, 2 nodes), but it's still a bit slower than the other frameworks in our examples
    And this particular example is running no image-augmentation
    Ilia Karmanov
    @ilkarman
    I can check with the team what mpi commands chainersettings executes on BatchAI, however the raw job is: "chainerSettings": {
    "additionalProperties": {},
    "commandLineArgs": null,
    "processCount": 8,
    "pythonInterpreterPath": null,
    "pythonScriptFilePath": "$AZ_BATCHAI_INPUT_SCRIPTS/ImagenetEstimatorChainer.py"
    },
    When we used pytorch we got 4000 seconds training with augmentation for one epoch of resnet50 with 2 nodes and 8 K80s; at the moment Chainer is taking 6600 seconds
    Ilia Karmanov
    @ilkarman
    At the end it took 7039 seconds for one epoch with Chainer
    With pytorch we were getting around 4000
    Sameroom
    @sameroom-bot
    [Keisuke Fukuda, chainer] Let me join the discussion. I’m from the ChainerMN team. I will take a look at the details.
    Ilia Karmanov
    @ilkarman
    Sure, any help is appreciated
    Keisuke Fukuda
    @keisukefukuda
    @ilkarman
    I would like to make sure about some details.
    (Sorry if I did not follow the conversations fully and miss something already discussed.)
    • What version of NCCL did you use for ChainerMN
    • What MPI did you use?
    • What communication layer did you use for PyTorch?
    • If you use Gloo communication library for PyTorch, I should support multiple underlying communication library (NCCL, native ibverbs, or TCP). Which did you use?
    • Could you check what happens if you change MultinodeOptimizer to Chainer’s normal optimizer, while keeping all the other parts identical? (Maybe this needs more explanation)
    [Albert Kahira, chainer] Running this exactly as it is with mpiexec -np 4 python train.py --gpu --epoch 20 raises the following error
    [Albert Kahira, chainer] Expect: in_types[0].shape[1] == in_types[1].shape[1] * 1
    Actual: 4 != 16
    Sameroom
    @sameroom-bot
    [Keisuke Fukuda, chainer] I will try it. thanks for reporting!
    Sameroom
    @sameroom-bot
    [Albert Kahira, chainer] Has anyone experience this error when running data parallelism example "*** Error in `python': munmap_chunk(): invalid pointer: 0x0000558e7f426730 ***"
    Sameroom
    @sameroom-bot
    [Albert Kahira, chainer] Someone had experienced a similar error on PyTorch. It turns out is a memory related error. Reducing the batch size fixes it. But I am curious why it doesn't show directly as "out of memory" error.
    Sameroom
    @sameroom-bot
    [Albert Kahira, chainer] Does anyone have a sample implementation of chainermn.CommunicatorBase.allreduce()
    Sameroom
    @sameroom-bot

    [Keisuke Fukuda, chainer] Do you want a example of using chainermn.CommunicatorBase.allreduce?
    Do you want to use it, not multi_node_mean_grad ?

    Anyways, I guess it’s pretty straightforward and easy to use allreduce(). just pass a numpy or CuPy array. Do you have a particular question on the usage of it?