Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Sameroom
    @sameroom-bot
    [Kenichi Maehashi, chainer] I noticed that bias of Linear link (1-dim) is copied from numpy to iDeep array every time running forward computation, because to_intel64() does not convert 1-dim array.

    [Kenichi Maehashi, chainer] ```py

    l = L.Linear(100, 100)
    l.to_intel64()
    type(l.W.array)

    <class 'ideep4py.mdarray'>
    type(l.b.array)

    <class 'numpy.ndarray'>
    ```

    [fengyuan, chainer] Yes, you are right. Optimized weight array is only created for ‘w’.
    Sameroom
    @sameroom-bot
    [Kenichi Maehashi, chainer] I think b should also be an optimized array, as we’re converting it into optimized array in F.linear to pass it to ideep4py.linear.Forward. https://github.com/chainer/chainer/blob/v5.0.0a1/chainer/functions/connection/linear.py#L60-L69
    Sameroom
    @sameroom-bot
    [fengyuan, chainer] There won’t be improvement in iDeep DNN primitive computation, if ‘b’ is initialized in mdarray. As to iDeep DNN APIs, like linear.Forward, interfaces could just accept mdarray as parameter. So ‘b’ is converted into mdarray.
    Sameroom
    @sameroom-bot
    [Kenichi Maehashi, chainer] I thought that overhead of copying b from NumPy array to iDeep optimized array happens every time running linear, because b is not initialized as mdarray. I thought such overhead can be eliminated by initializing b as mdarray. https://chainer.slack.com/archives/C4MQ9RMNG/p1521094728000070
    Sameroom
    @sameroom-bot
    [fengyuan, chainer] The most significant improvement is brought by optimized weight or data format in iDeep primitive computation.
    Overhead of ‘b’ copy may not impact much.(Size of ‘b’ equals to batch size.)
    Sameroom
    @sameroom-bot
    [Kenichi Maehashi, chainer] In Linear layer, size of b equals to the unit size. The overhead is relatively become large (not very large, but large enough to be measurable by benchmark) if the batch size is small (e.g., 1).
    Sameroom
    @sameroom-bot
    [fengyuan, chainer] Yes, you are right. That will be an optimization to reduce overhead of memory copy.
    BTW, our design is not fully corresponding to GPU. Overhead about copy between CPU and GPU is considerable. According to our design in the very beginning, the most reason why weight is initialized in mdarray is to get a array in weight format. That will bring performance improvement to iDeep primitive computation. At that time, we have not taken 'b' into account, not taken memory copy into account.
    Sameroom
    @sameroom-bot
    [Kenichi Maehashi, chainer] Thank you for the detailed explanation! I understood the policy of the original design.
    So, it is OK to change all_ready to accept (1, 2, 4) instead of (2, 4).
    https://chainer.slack.com/archives/C4MQ9RMNG/p1525777655000192?thread_ts=1522843389.000152&cid=C4MQ9RMNG
    Sameroom
    @sameroom-bot
    [fengyuan, chainer] In our design,
    1. all_ready is a programmer helper function. The purpose is to check inputs whether could be accepted by current iDeep primitive or not, not to check whether dimension is valid for iDeep mdarray or not. So, it is no needed to check all_ready, when W or b is initialized. Here is a mistake in Intel-Chainer release_v3. https://github.com/intel/chainer/blob/release_v3/chainer/links/connection/linear.py#L128
    2. Programmer must aware valid/supported dimension of inputs in current iDeep primitive. By default, it is (2, 4). Programmer could adjust supported_ndim of all_ready accordingly.
      My suggestion here is to initialize b in mdarray directly without all_ready. What do you think of that?

    [fengyuan, chainer] e.g. Explicitly define supported_ndim, if an iDeep primitive just supports 1-D inputs.

    if intel64.inputs_all_ready((x, y, z, ), supported_ndim = (1, )):
    intel64.ideep.some_primitive(x, y ,z)

    Sameroom
    @sameroom-bot
    [Kenichi Maehashi, chainer] Thank you, now I understand why I got wrong. Chainer uses all_ready in Variable.to_intel64, whereas intel-chainer does not use it in Variable.to_ia.
    https://github.com/chainer/chainer/blob/v5.0.0a1/chainer/variable.py#L770-L786
    https://github.com/intel/chainer/blob/release_v3/chainer/variable.py#L728
    Sameroom
    @sameroom-bot
    [Kenichi Maehashi, chainer] We used all_ready in Variable.to_intel64 to support calling to_intel64 on any user-defined Link (which may have a parameter whose shape is not supported by ideep). We had to call inputs_all_ready with (1, 2, 4) here.
    Sameroom
    @sameroom-bot
    [fengyuan, chainer] Got it. iDeep should support any dims array, which will be implemented in latter iDeep(v2). Currently, you can implement a light helper function to check supported dims when iDeep array creation, maybe not all_ready, which is heavy?
    Sameroom
    @sameroom-bot
    [Kenichi Maehashi, chainer] Right, data.ndim in (1, 2, 4) sounds sufficient.
    Sameroom
    @sameroom-bot
    [ikei, chainer] :1
    Sameroom
    @sameroom-bot
    [Kenichi Maehashi, chainer] Do you have a plan when iDeep 2.0 will be released?
    Sameroom
    @sameroom-bot
    [Kenichi Maehashi, chainer] I also hope this fix is included into iDeep4py 2.0. https://chainer.slack.com/archives/C4MQ9RMNG/p1517983472000240
    Currently users are forced to install NumPy==1.13.0 and cannot use other versions of NumPy.
    Setting requirements to numpy>=1.13.0 would be nice.
    [Cao Zhong, chainer] It is nearly done. In master branch. We are deciding when to release it and what is in it. @kmaehashi
    [Cao Zhong, chainer] Yes, we’ll remove that constraint.
    [Kenichi Maehashi, chainer] Nice, thank you!
    Sameroom
    @sameroom-bot
    [Kenichi Maehashi, chainer] I’d like to copy data from NumPy array to iDeep array, and found ideep4py.basic_copyto(dst, src) which seems like an equivalent for numpy.copyto(dst, src). Is this basic_copyto considered stable? Can we use this interface? (context: chainer/chainer#5009)
    [Cao Zhong, chainer] @feng1.yuan
    Sameroom
    @sameroom-bot
    [fengyuan, chainer] Yes, That is a stable API of iDeep python package (ideep4py).
    The case would be,
    ideep4py.basic_copyto(x_ideep, x_np).
    [Kenichi Maehashi, chainer] Ok, thanks for clarification!
    Sameroom
    @sameroom-bot
    [mingxiao huang, chainer] @mitmul Thanks. So, any findings on your site?
    Sameroom
    @sameroom-bot
    [Shunta Saito, chainer] Currently we haven't succeeded to manage the time to work on it for now. Please give us some time by this weekend. Sorry about that.
    Sameroom
    @sameroom-bot
    [mingxiao huang, chainer] @mitmul ok, Chainermn's batch normalization implementation is heavily based on Chainer's batch normalization old version, I am trying to sync Chainermn's batch normalization with Chainer's batch normalization latest version. I will keep you informed with the progress. Thanks.
    Sameroom
    @sameroom-bot
    [mingxiao huang, chainer] @mitmul we have found some something improper in our script, we did not shuffle the input training set during scatter. After we shuffle the data, we haven't seen any validation gap compared to single node after 10000 iterations. Thanks.
    [Shunta Saito, chainer] Great! Thanks for letting us know it.
    Sameroom
    @sameroom-bot
    [mingxiao huang, chainer] @nishino hello, nishino, recently we trained resnet50 GPU model on single node and only got 73+% validation accuracy on ILSVRC2012, there is 2% gap as compared to SOTA. We used poly learning rate policy. It is said that you can got SOTA for resnet50 on your site, may you please share the hyper parameters with us? Thanks.
    Sameroom
    @sameroom-bot
    [Shunta Saito, chainer] Which site are you referring? I'll ask people who conducted that experiment if I can know the exact page URL.
    Sameroom
    @sameroom-bot
    [mingxiao huang, chainer] @mitmul from http://on-demand.gputechconf.com/gtc/2018/presentation/s8889-training-imagenet-in-15-minutes-with-chainermn-a-scalable-distributed-dl-framework.pdf, we can see that you got a comparable accuracy of 74.9% on multi-node. May you please send the training script file you used to us? If you have ever trained resnet50 on single node, it would be better to send us the training script file you used for single node to us. Thanks.
    Sameroom
    @sameroom-bot
    [Shunta Saito, chainer] @mingxiao.huang The batchsize are you using for your experiment is also 32000?
    [Shunta Saito, chainer] @kfukuda Hi Fukuda-san, he is mentioning your slides you presented at GTC and asking the training setting to achieve 74.9% accuracy on ImageNet with ResNet50. Can we share the training script with him?
    [Keisuke Fukuda, chainer] I think we need to consult Akiba-san.
    [Keisuke Fukuda, chainer] However, basically, everything is written in the paper
    Sameroom
    @sameroom-bot
    [Shunta Saito, chainer] Yeah, @mingxiao.huang Have you already read this paper? https://arxiv.org/abs/1711.04325
    Sameroom
    @sameroom-bot
    [mingxiao huang, chainer] @mitmul Yes, we have read it. However, to make the problem simpler, we are trying single GPU case now, it would be nice if you have such a training script file for single node and send it to us. Thanks.
    Sameroom
    @sameroom-bot
    [Shunta Saito, chainer] What is the global minibatch size? I think the techniques written in the paper is for "Extremely large minibatch" as you can see in the title. If the global minibatch size is not so large because you are just using a single GPU, the technique required to achieve good results will be different. What do you think @kfukuda?
    Sameroom
    @sameroom-bot
    [mingxiao huang, chainer] @mitmul Yes, that is also my concern. for single node, we used batchsize=128. The technique required to achieve good results will be different for single node training, so, it would be nice if you have such a training script file for single node and send it to us. Thanks.
    Sameroom
    @sameroom-bot
    [mingxiao huang, chainer] @mitmul any more comments? Do you have such a training script file for single node ?
    Sameroom
    @sameroom-bot

    [Shunta Saito, chainer] @mingxiao.huang Could you answer to his question?

    Did you achieve the SOTA with the original ResNet-50 (with batchsize 64) ?

    [Shunta Saito, chainer] Regarding a training script, I think it's OK to use the normal one for ResNet50 training on ImageNet-1K with 128 minibatchsize. You can find the example code in Chainer repository. Have you already tried that?
    Sameroom
    @sameroom-bot
    [mingxiao huang, chainer] @mitmul the example code in Chainer repository is rather simple, actually, we would like to reproduce your result with detailed hyper parameters you used, since we could not get SOTA accuracy on resnet50 on chainer at all, however, with same parameters, we can achieve SOTA on caffe.
    Sameroom
    @sameroom-bot
    [Shunta Saito, chainer] @mingxiao.huang What we intended to say was that you need to use "extremely large batchsize" to reproduce the result on paper. If you want to achieve over 74% accuracy by using only 1 GPU, you need different technique for that setting. I think we have never tested such small environment with only 1 GPU because it takes too much time to finish 90 epochs. Is this correct? @kfukuda
    Sameroom
    @sameroom-bot
    [Keisuke Fukuda, chainer] Yes, even the original ResNet author (Kaiming He) used 8 GPUs (if I remember correctly). The technique written on the paper is for extremely large minibatch size (>8K) and it does not make much sense to use the techniques (RMSProp warmup etc.) for such a small batchsize.
    [Keisuke Fukuda, chainer] Before trying our techniques, I recommend to try Facebook’s result with 4K batchsize.
    [Keisuke Fukuda, chainer] Techniques for large batch size does not necessarily good for small batchsize.