Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    sergei-from-vironit
    @sergei-from-vironit
    Hello. I train model with opennmt-tf 1.24.0 version. And try to serving this model by tensorflow/serving:2.1.0-rc1-gpu. And get this kind error. What can be it?
    2020-01-08 09:57:52.021820: E external/org_tensorflow/tensorflow/core/grappler/optimizers/meta_optimizer.cc:561] function_optimizer failed: Not found: Op type not registered 'Addons>GatherTree' in binary running on model-serve-dev-fast-658d8d96b9-vqmdt. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
    2020-01-08 09:57:52.132840: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at partitioned_function_ops.cc:113 : Not found: Op type not registered 'Addons>GatherTree' in binary running on model-serve-dev-fast-658d8d96b9-vqmdt. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed
    Guillaume Klein
    @guillaumekln
    Hi,. So you exported your model using OpenNMT-tf 2.x? If yes, you should use a custom serving image opennmt/tensorflow-serving:2.0.0-gpu which includes additional op. See here for more info: https://github.com/OpenNMT/OpenNMT-tf/tree/master/examples/serving/tensorflow_serving#custom-tensorflow-serving-image
    sergei-from-vironit
    @sergei-from-vironit
    hm
    but i not set beam_width.
    Why i should create custom serving api?
    Guillaume Klein
    @guillaumekln
    The StackOverflow answer provides 2 solutions. Which one did you choose?
    sergei-from-vironit
    @sergei-from-vironit
    I choose 2.
    Now i try your tensorflow/serving-api and it is ok.
    Thanks!
    emesha92
    @emesha92

    Hi,. So you exported your model using OpenNMT-tf 2.x? If yes, you should use a custom serving image opennmt/tensorflow-serving:2.0.0-gpu which includes additional op. See here for more info: https://github.com/OpenNMT/OpenNMT-tf/tree/master/examples/serving/tensorflow_serving#custom-tensorflow-serving-image

    Is the openmt’s tf serving built using optimized version or not?

    Guillaume Klein
    @guillaumekln
    It uses the same build flags as the tensorflow/serving images.
    emesha92
    @emesha92
    ok noted, thanks @guillaumekln
    sergei-from-vironit
    @sergei-from-vironit
    Hello all. May be anybody know, why task for train is stuck?
    image.png
    ...is stuck on this row "If using Keras pass *_constraint arguments to layers."
    Guillaume Klein
    @guillaumekln
    Hi! The training is running on CPU. The latest OpenNMT-tf version uses TensorFlow 2.1 which requires CUDA 10.1.
    sergei-from-vironit
    @sergei-from-vironit
    And i try start train task on 2 video card (RTX2080), but when OpenNMT-tf start "init data" phase, so my PC rebooting.
    image.png
    hm
    may be you right.
    But i start train task with this command: " CUDA_VISIBLE_DEVICES=0 onmt-main --model_type TransformerBig --config config.yml --auto_config train --num_gpus 1"
    Ok. I see. Don't load cuda library. Thanks ^)
    sergei-from-vironit
    @sergei-from-vironit
    image.png
    So strange. My PC go to in reboot, while in saving checkpoint phase on 2 video card. May be any ideas?
    Guillaume Klein
    @guillaumekln
    I remember this could happen when the power supply is not powerful enough.
    sergei-from-vironit
    @sergei-from-vironit
    Can i limit GPU load ?
    sergei-from-vironit
    @sergei-from-vironit
    I load 2 task on cards (1 task on 1 card). And no restart.
    image.png
    May be some problem in parallel tasks?
    Guillaume Klein
    @guillaumekln
    Apparently you can try limiting the maximum power used by the cards with nvidia-smi -pl (I never used that so can't advise more).
    sergei-from-vironit
    @sergei-from-vironit
    Ok. Thanks, i try it
    sergei-from-vironit
    @sergei-from-vironit
    Hello. I try do two test: with 1kk dataset and 2kk data set. Do I have to change the configuration? (may be more steps if i use longer dataset)
    Guillaume Klein
    @guillaumekln
    Hi, if you want to compare the 2 results you probably want to keep the same configuration and the same number of training steps.
    sergei-from-vironit
    @sergei-from-vironit
    I did this test, and result very different.
    May be one dataset worse than second.
    But may be we need remake some parameters )
    Guillaume Klein
    @guillaumekln
    Better result with more data? This looks expected.
    sergei-from-vironit
    @sergei-from-vironit
    no. worse result with more dataset
    Anna Samiotou
    @annasamt
    image.png
    Hello, I have trained OpenNMT-tf models with version 2.1.1 in a Linux/Ubuntu 18.04 GPU set up . I have now installed latest version 2.6.0 on a new VM which has CPU-only. In principle, is it possible to run inference from the latter CPU-only machine on the above-mentioned GPU trained models? Btw, when running inference, I do get a couple of error messages reg. missing libraries (please see screenshot) but I am not sure whether I should ignore them if I only use CPU
    Guillaume Klein
    @guillaumekln
    From the logs, it seems you are loading a checkpoint trained wtih version 1.x and not 2.1.1. Is that expected?
    (Yes, you should ignore warnings about missing NVIDIA libraries when running on CPU)
    Anna Samiotou
    @annasamt
    No, the models were trained with v2x for sure
    Guillaume Klein
    @guillaumekln
    What is the content of the model_dir directory that you defined in the configuration?
    Anna Samiotou
    @annasamt
    image.png
    Is this what you mean, or the contents of data.yml
    Guillaume Klein
    @guillaumekln
    Yeah that is what I wanted to see. Do you also have a file named checkpoint? If so, what is its content?
    Anna Samiotou
    @annasamt
    image.png
    Yes, I do. But this one refers to the averaged checkpoint of 27K steps (i.e. the 20K checkpoint file was overwritten by this one). But I decided to use the 20K checkpoint for the inference - is this OK?