Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    sergei-from-vironit
    @sergei-from-vironit
    Thanks!
    emesha92
    @emesha92

    Hi,. So you exported your model using OpenNMT-tf 2.x? If yes, you should use a custom serving image opennmt/tensorflow-serving:2.0.0-gpu which includes additional op. See here for more info: https://github.com/OpenNMT/OpenNMT-tf/tree/master/examples/serving/tensorflow_serving#custom-tensorflow-serving-image

    Is the openmt’s tf serving built using optimized version or not?

    Guillaume Klein
    @guillaumekln
    It uses the same build flags as the tensorflow/serving images.
    emesha92
    @emesha92
    ok noted, thanks @guillaumekln
    sergei-from-vironit
    @sergei-from-vironit
    Hello all. May be anybody know, why task for train is stuck?
    image.png
    ...is stuck on this row "If using Keras pass *_constraint arguments to layers."
    Guillaume Klein
    @guillaumekln
    Hi! The training is running on CPU. The latest OpenNMT-tf version uses TensorFlow 2.1 which requires CUDA 10.1.
    sergei-from-vironit
    @sergei-from-vironit
    And i try start train task on 2 video card (RTX2080), but when OpenNMT-tf start "init data" phase, so my PC rebooting.
    image.png
    hm
    may be you right.
    But i start train task with this command: " CUDA_VISIBLE_DEVICES=0 onmt-main --model_type TransformerBig --config config.yml --auto_config train --num_gpus 1"
    Ok. I see. Don't load cuda library. Thanks ^)
    sergei-from-vironit
    @sergei-from-vironit
    image.png
    So strange. My PC go to in reboot, while in saving checkpoint phase on 2 video card. May be any ideas?
    Guillaume Klein
    @guillaumekln
    I remember this could happen when the power supply is not powerful enough.
    sergei-from-vironit
    @sergei-from-vironit
    Can i limit GPU load ?
    sergei-from-vironit
    @sergei-from-vironit
    I load 2 task on cards (1 task on 1 card). And no restart.
    image.png
    May be some problem in parallel tasks?
    Guillaume Klein
    @guillaumekln
    Apparently you can try limiting the maximum power used by the cards with nvidia-smi -pl (I never used that so can't advise more).
    sergei-from-vironit
    @sergei-from-vironit
    Ok. Thanks, i try it
    sergei-from-vironit
    @sergei-from-vironit
    Hello. I try do two test: with 1kk dataset and 2kk data set. Do I have to change the configuration? (may be more steps if i use longer dataset)
    Guillaume Klein
    @guillaumekln
    Hi, if you want to compare the 2 results you probably want to keep the same configuration and the same number of training steps.
    sergei-from-vironit
    @sergei-from-vironit
    I did this test, and result very different.
    May be one dataset worse than second.
    But may be we need remake some parameters )
    Guillaume Klein
    @guillaumekln
    Better result with more data? This looks expected.
    sergei-from-vironit
    @sergei-from-vironit
    no. worse result with more dataset
    Anna Samiotou
    @annasamt
    image.png
    Hello, I have trained OpenNMT-tf models with version 2.1.1 in a Linux/Ubuntu 18.04 GPU set up . I have now installed latest version 2.6.0 on a new VM which has CPU-only. In principle, is it possible to run inference from the latter CPU-only machine on the above-mentioned GPU trained models? Btw, when running inference, I do get a couple of error messages reg. missing libraries (please see screenshot) but I am not sure whether I should ignore them if I only use CPU
    Guillaume Klein
    @guillaumekln
    From the logs, it seems you are loading a checkpoint trained wtih version 1.x and not 2.1.1. Is that expected?
    (Yes, you should ignore warnings about missing NVIDIA libraries when running on CPU)
    Anna Samiotou
    @annasamt
    No, the models were trained with v2x for sure
    Guillaume Klein
    @guillaumekln
    What is the content of the model_dir directory that you defined in the configuration?
    Anna Samiotou
    @annasamt
    image.png
    Is this what you mean, or the contents of data.yml
    Guillaume Klein
    @guillaumekln
    Yeah that is what I wanted to see. Do you also have a file named checkpoint? If so, what is its content?
    Anna Samiotou
    @annasamt
    image.png
    Yes, I do. But this one refers to the averaged checkpoint of 27K steps (i.e. the 20K checkpoint file was overwritten by this one). But I decided to use the 20K checkpoint for the inference - is this OK?
    Guillaume Klein
    @guillaumekln
    That could be an issue as we try to load the last checkpoint by default. However, this does not match the error log. Could you try using the --checkpoint_path command line option and point to the 20k checkpoint?
    Anna Samiotou
    @annasamt
    I think that this exactly what I did - Please look at the command in the "script-NMT-OpenNMT-tf.sh" : onmt-main --config $1/data.yml --auto_config --model_type Transformer --checkpoint_path $1/model.ckpt-20000 infer --features_file $2/s.txt --predictions_file $2/t.txt
    Anna Samiotou
    @annasamt
    Is version OpenNMT-tf v2.6.0 backward compatible with v 2.2.1? Perhaps if I install the latter in the VM instead? Perhaps the fail reg. TensorSliceReader constructor will then disappear?
    Anna Samiotou
    @annasamt
    Well, in https://github.com/OpenNMT/OpenNMT-tf I see that "The project is production-oriented and comes with backward compatibility guarantees." so please ignore my previous message
    Guillaume Klein
    @guillaumekln
    You need to set ckpt-20000, not model.ckpt-20000 on the command line.
    Anna Samiotou
    @annasamt
    Seems to work now! Thanks for the tip! So did this change with 2x of with the latest 2.6.0?
    Guillaume Klein
    @guillaumekln
    The model prefix changed when moving from 1.x to 2.x.
    Anna Samiotou
    @annasamt
    OK, it was a "left-over" from the script I used with 1x.. and did not notice.. Merci bien!
    Anna Samiotou
    @annasamt
    Hello again, for adding more train data sets to an existing model (checkpoint) and train it further, not necessarily for fine-tuning but rather for adding more parallel examples to the supervised learning, is the best practice to add the new train data sets to the config file (e.g. data.yml) or do update-vocab (merge) or both?