Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/decoders/decoder.py:396 dynamic_decode
    return decoding.dynamic_decode(
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/utils/decoding.py:491 dynamic_decode

    ids, attention, lengths = decoding_strategy._finalize( # pylint: disable=protected-access
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/utils/decoding.py:344 _finalize
    ids = tfa.seq2seq.gather_tree(step_ids, parent_ids, maximum_lengths, end_id)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_addons/seq2seq/beam_search_decoder.py:35 gather_tree

    return _beam_search_so.ops.addons_gather_tree(args, *kwargs)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_addons/utils/resource_loader.py:49 ops
    self._ops = tf.load_op_library(get_path_to_datafile(self.relative_path))
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/framework/load_library.py:57 load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
    NotFoundError: /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_addons/custom_ops/seq2seq/_beam_search_ops.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs
    Here is the other part @guillaumekln
    Guillaume Klein
    What TensorFlow version do you have installed?
    I am looking for a working example of OpenNMT-tf and keras. Any suggestion for a github repo?
    Anyone suggest any example for OpenNMT-tf with keras?
    Also what is the difference between OpenNMT-tf and https://github.com/lvapeab/nmt-keras
    Guillaume Klein
    Do you mean an example using the Keras functional API? If yes, I'm not aware of such example.
    I did not know about nmt-keras, but I would say the difference is that nmt-keras prioritizes Keras APIs while OpenNMT-tf prioritizes TensorFlow APIs.
    Wondering about status of HOROVOD use with OpenNMT-tf (tag 1.25.3) Does it work out of the box in a multi node environment? I've been trying to test distributed train_and_eval in a multi node environment with 8 V100 GPUs per node, but without luck.
    Guillaume Klein
    It should work with OpenNMT-tf 1.x. Did you read this documentation? https://github.com/OpenNMT/OpenNMT-tf/blob/v1.25.3/docs/training.md#via-horovod-experimental

    I'm trying to run with 16 GPUs (2 nodes * 8 GPU) using the following, but without luck:

    nmarcel8@hpclogin1:/s/mlsc/nmarcel8/nmt/Portuguese/2020-03-23> cat exp.sh
    export PYTHONPATH=/s/mlsc/nmarcel8/git/mtda

    mpirun -np 16 \
    --hostfile ${PBS_NODEFILE} \
    -bind-to none -map-by slot \
    -mca btl_tcp_if_exclude lo,docker0 \
    -x PATH \
    -mca pml ob1 -mca btl ^openlib \
    python \
    $base_dir/opennmt/bin/main.py \
    train_and_eval \
    --model_type TransformerBigFP16 \
    --config exp.yml \
    --auto_config \

    Guillaume Klein
    Is there an error? Or what is the issue?
    Right now my job is in the queue waiting to launch. Quick question. Do we need to specify --num_gpus as a parameter to bin/main.py?
    I'm thinking if we specified the horovod flag that it's not necessary.
    Guillaume Klein
    If I remember correctly you don't need to specify this option as Horovod will launch 16 separate processes.
    Getting errors like this:
    I0326 15:42:23.700688 140253767198528 estimator.py:1147] Done calling model_fn.
    Traceback (most recent call last):
    File "/s/mlsc/nmarcel8/git/mtda/opennmt/bin/main.py", line 204, in <module>
    File "/s/mlsc/nmarcel8/git/mtda/opennmt/bin/main.py", line 175, in main
    File "/s/mlsc/nmarcel8/git/mtda/opennmt/runner.py", line 301, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
    return executor.run()
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
    return self.run_local()
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1426, in _train_with_estimator_spec
    'There should be a CheckpointSaverHook to use saving_listeners. '
    ValueError: There should be a CheckpointSaverHook to use saving_listeners. Please set one of the RunConfig.save_checkpoints_steps or RunConfig.save_checkpoints_secs.
    <<When I run on a single node (using just 8 GPUs) everything works fine; but not on multinode.>>
    Guillaume Klein
    Can you try using train instead of train_and_eval?
    I submitted the job. Let me check on the status:

    nmarcel8@hpclogin1:/s/mlsc/nmarcel8/nmt/Portuguese/2020-03-23> fqstat

    Job ID Job Name App CPUs S Elap Time User

    6963772.hpcq exp HOROVOD 16 R 00:00:11 nmarcel8

    With train I get messages like the following
    hpca02r01n04:180:799 [4] NCCL INFO comm 0x7f4c3b1e2fb0 rank 4 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE
    hpca02r01n04:178:801 [2] NCCL INFO Trees [0] 6->2->0/-1/-1 [1] 1->2->3/-1/-1 [2] -1->2->1/10/-1 [3] 3->2->6/-1/-1 [4] 6->2->0/-1/-1 [5] 1->2->3/-1/-1 [6] 10->2->1/-1/-1 [7] 3->2->6/-1/-1
    hpca02r01n04:176:800 [0] NCCL INFO Ring 07 : 8 -> 0 [receive] via NET/IB/3/GDRDMA
    hpca02r01n04:182:802 [6] NCCL INFO comm 0x7f5e6f1e1cd0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE
    hpca02r01n04:179:805 [3] NCCL INFO comm 0x7efb47136920 rank 3 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE
    hpca02r01n04:178:801 [2] NCCL INFO comm 0x7fb877137090 rank 2 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE
    hpca02r02n05:38:629 [0] NCCL INFO Ring 07 : 8 -> 0 [send] via NET/IB/3/GDRDMA
    hpca02r02n05:38:629 [0] NCCL INFO Trees [0] 10->8->11/-1/-1 [1] 12->8->9/-1/-1 [2] 9->8->12/-1/-1 [3] 0->8->11/-1/-1 [4] 10->8->11/-1/-1 [5] 12->8->9/-1/-1 [6] 9->8->12/-1/-1 [7] -1->8->11/0/-1
    hpca02r02n05:38:629 [0] NCCL INFO comm 0x7f6197879f50 rank 8 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
    hpca02r01n04:176:800 [0] NCCL INFO Trees [0] 2->0->3/-1/-1 [1] 4->0->1/-1/-1 [2] 1->0->4/-1/-1 [3] -1->0->3/8/-1 [4] 2->0->3/-1/-1 [5] 4->0->1/-1/-1 [6] 1->0->4/-1/-1 [7] 8->0->3/-1/-1
    hpca02r01n04:176:800 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled for all sizes
    hpca02r01n04:176:800 [0] NCCL INFO comm 0x7effb60eb430 rank 0 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
    hpca02r01n04:176:800 [0] NCCL INFO Launch mode Parallel
    I0326 16:02:05.270775 139641160255296 basic_session_run_hooks.py:262] loss = 9.732766, step = 0
    I0326 16:02:05.633767 140434656532288 basic_session_run_hooks.py:262] loss = 9.731406, step = 0
    I0326 16:02:05.639509 140400681133888 basic_session_run_hooks.py:262] loss = 9.727156, step = 0
    I0326 16:02:05.661324 140559522916160 basic_session_run_hooks.py:262] loss = 9.736551, step = 0
    I0326 16:02:05.671058 139726011828032 basic_session_run_hooks.py:262] loss = 9.732926, step = 0
    I0326 16:02:05.674440 140535427962688 basic_session_run_hooks.py:262] loss = 9.729025, step = 0
    I0326 16:02:05.675842 140709400389440 basic_session_run_hooks.py:262] loss = 9.741848, step = 0
    I0326 16:02:05.676417 139622124476224 basic_session_run_hooks.py:262] loss = 9.73002, step = 0
    I0326 16:02:05.681042 139969783359296 basic_session_run_hooks.py:262] loss = 9.730884, step = 0
    I0326 16:02:05.686120 139980528740160 basic_session_run_hooks.py:262] loss = 9.743285, step = 0
    I0326 16:02:05.689950 140085856864064 basic_session_run_hooks.py:262] loss = 9.726588, step = 0
    I0326 16:02:05.690726 140271360452416 basic_session_run_hooks.py:262] loss = 9.741635, step = 0
    I0326 16:02:05.692858 140061501163328 basic_session_run_hooks.py:262] loss = 9.737735, step = 0
    I0326 16:02:05.693819 140047958247232 basic_session_run_hooks.py:262] loss = 9.734624, step = 0
    I0326 16:02:05.695163 139628563687232 basic_session_run_hooks.py:262] loss = 9.73243, step = 0
    I0326 16:02:05.699985 139695624750912 basic_session_run_hooks.py:262] loss = 9.742135, step = 0
    W0326 16:02:17.888350 140400681133888 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
    W0326 16:02:17.890248 139641160255296 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
    W0326 16:02:17.891450 140434656532288 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.applygradients or Optimizer.minimize.
    W0326 16:02:17.899081 140559522916160 basic
    It is getting stuck on and just repeating the message <<It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0.>>
    Guillaume Klein
    I don't recall this type of logs in the context of Horovod. Note that I'm working to integrate Horovd in V2: OpenNMT/OpenNMT-tf#639. Maybe you could try this one if it is possible for you to update.
    Jordi Mas
    Everytime that I train a model with the same data set a get a different model, since the output of neural networks is not deterministic. Even if I compare 2 models with very similar BLEUs they have very different way to translate. All the time a new model improves in some sentences but it does worse with other. This makes really difficult to
    a) Build quality assess based on previous results since every time models translate differently
    b) I can be confusing to users since a new model translate different from previous one
    I guess that I can play with the validation corpus and re-training.
    Any recommendation on how to mitigate both?
    Guillaume Klein
    Hi. Is it a specific set of sentences for which the translation should not change? If yes, you probably want to turn these sentences into translation memories.
    Jordi Mas
    No, this is a generic translator English -> Catalan. I found challenging that every model translates differenly. I guess that for quality I'm left with metrics like BLEU and human review to evaluate the new models. Thanks
    @guillaumekln I'll have a look. Thanks.
    Soumya Chennabasavaraj
    Hi I am trying to train a translation model with basic rnn 2 layers enc and dec. The training process is stuck at 1st validation. It shows loading dataset . number of examples and thats it. What is the solution to this ? Thanks
    Guillaume Klein
    How big is the validation dataset?
    Soumya Chennabasavaraj
    training set is 178822
    each epoch takes around 4-5 sec for training. but it gets stuck when it needs to do validation
    Guillaume Klein
    Can you post the training logs? How long did you wait to conclude that the process is stuck?
    Soumya Chennabasavaraj
    command : onmt_train -data data/demo -save_model demo-model --src_word_vec_size 50 --tgt_word_vec_size 50 --report_every 1 --train_steps 100 --valid_steps 2 --enc_rnn_size 200 --dec_rnn_size 200
    Guillaume Klein
    Looks like you are in the wrong channel. Please post that in the OpenNMT-py channel or on the forum.
    Soumya Chennabasavaraj
    Okay. thanks
    Hi Guillaume, I have an old version of OpenNMT-tf installed, how to upgrade to the latest 2.9? I ran 'pip install --upgrade pip' and 'pip install OpenNMT-tf' but it seems the upgrade does not happen..Thanks.
    Guillaume Klein
    Hi. Try with pip install --upgrade OpenNMT-tf
    OK, thanks !
    Hi Guillaume, I ran the basic command to train a customized transfer model after upgrading to 2.9, but saw the following error: Use eager execution and:
    INFO:tensorflow:Accumulate gradients of 2 iterations to reach effective batch size of 25000
    Traceback (most recent call last):
    File "/conda/envs/simcloud/bin/onmt-main", line 8, in <module>
    File "/conda/envs/simcloud/lib/python3.6/site-packages/opennmt/bin/main.py", line 223, in main
    File "/conda/envs/simcloud/lib/python3.6/site-packages/opennmt/runner.py", line 205, in train
    devices = misc.get_devices(count=num_devices)
    File "/conda/envs/simcloud/lib/python3.6/site-packages/opennmt/utils/misc.py", line 34, in get_devices
    count, len(devices), "is" if len(devices) == 1 else "are"))
    ValueError: Requested 8 devices but only 1 is visible
    I guess it's hardware related incompatibility issue? my gpu is cuda 10.0
    Guillaume Klein
    TensorFlow 2.1 requires CUDA 10.1.
    Michael A. Martin

    Is anyone else seeing a mecab install failure when installing OpenNMT-tf 2.9.1?

    (nlp_mt_testing) C:\Users\mm9q\PycharmProjects\nlp_mt_testing>pipenv update
    Running $ pipenv lock then $ pipenv sync.
    Locking [dev-packages] dependencies…
    Locking [packages] dependencies…
    Updated Pipfile.lock (c017cc)!
    Installing dependencies from Pipfile.lock (c017cc)…
    An error occurred while installing mecab-python3==0.996.5 --hash=sha256:0758c4e428c9eda01b14f2e93b4b48055264c4044c43992a8421bfd2a27d9ae0 --hash=sha256:0e309f7f55c608b66c5dbd9a1e62f2686e9ff0923a5f
    1c28406985c4d8a80549 --hash=sha256:1af08a46774ac219bf93cc0c52d87d5cbcce9f4c3abba6b14c374f81ebc718c5 --hash=sha256:1bfe39e3c6a5be7bf54d36fcc14c5938fc831960507f9d7337cd2cc0e8de07d3 --hash=sha256:2d
    066afe09a95716facf60a673f538257f595a282beb1e021d1b495f995ee70e --hash=sha256:304ee365de78cf9c48373122bb68a5f2c2cb60c9d058a690b3cdaf1059d4c91c --hash=sha256:3187479c79151f384f44c0d12a87305bfab94be
    f90183bd6db0bcd44c2c3374b --hash=sha256:32864543977281dcbb52f54e42d9fd060c9d869874d49806bbee5a0ff689e665 --hash=sha256:3f2a591460f28faf27318961c49e70b9e464a283c8184a677bdddc70aa8835b6 --hash=sha2
    56:429c92effeb46336e994b2d4b29a8b9e57ff947df55fd5b8042315fdb50d573a --hash=sha256:4732677c74dae291587d930d307bfd92776a198cbb1ea2836ede652180c9dbd1 --hash=sha256:47ba0cf7b5b03137ba7b3998a6fe82da14
    07ac577016df10b6da7f1979dfb0ca --hash=sha256:4fb35dd2486b3d8b1868cfa8e1772e195c26a90b02f2509e4b61e11fd846b442 --hash=sha256:52aa68cb5bbf993d3862feb9c477f230ed27c4cfc8e3fa3388419561cd02ffcb --hash
    =sha256:52e5eb828d42f4cdd41eb3e689d6130d06a3576fee067e04c40566c7ce09ba1b --hash=sha256:5864b6e55796e982fba8a8bb073f97893cbe2c55481692d83e6bd1ad60125585 --hash=sha256:6ddf32a2e0bda7628f0f57ded67be
    4601a31ffbca6524e328bec3ef098abff10 --hash=sha256:730898c5e55f6a5f894251603afd44ebfdbe7ddec00113c233ef1b9c91b8690a --hash=sha256:8d13b2732b3951a0defaa56acd628d1d83f123fdafee4f40701e001a66717a85 -
    -hash=sha256:92019635b0686a6aa30f4d908155696077abe8848203e102ae7befab0b96b9ea --hash=sha256:9af5050a093172123be6eedf65f263fb936885c5c200b1808cb1d2f15a59e871 --hash=sha256:b33f212afff92fc292a8b20a
    b82d2fbc9a4d2f6fbe6dd96e9104ef3cf1150944 --hash=sha256:b80cb1af8b7c3e8d5e88fd7b5511e8ad630717f81e9117685613b10216c0f648 --hash=sha256:bd8bce5251153199acb3aff72d22954801ef3c46a37b4ee17de5e32c37388
    f58 --hash=sha256:e07f3ff74185d286a138258b7d7f01877ddad2338c8177bd579a580f0c589abc --hash=sha256:e2118906bbdfd17d002c4eab649a71c9635de175728933c2a226fd8533d13ff7 --hash=sha256:e7f09caf136903ce908
    b8b001ffc178d6caa129c1550d47d8f7f733a213749a8 --hash=sha256:ec6e3872b716eff99a0b3cad1079eeb16a5efdcf6dd48727e3372345a0d0d124 --hash=sha256:f256a1dec3d6bea8cf9b6fd77b345903acf978c07c592c0dc470d203
    98693da8! Will try again.
    ================================ 120/120 - 00:01:13
    Installing initially failed dependencies…
    pipenv.exceptions.InstallError: File "c:\users\mm9q\appdata\local\programs\python\python37\lib\site-packages\pipenv\core.py", line 2611, in do_sync

    pipenv.exceptions.InstallError: File "c:\users\mm9q\appdata\local\programs\python\python37\lib\site-packages\pipenv\core.py", line 1253, in do_init

    pipenv.exceptions.InstallError: File "c:\users\mm9q\appdata\local\programs\python\python37\lib\site-packages\pipenv\core.py", line 859, in do_install_dependencies
    pipenv.exceptions.InstallError: retry_list, procs, failed_deps_queue, requirements_dir, **install_kwargs
    pipenv.exceptions.InstallError: File "c:\users\mm9q\appdata\local\programs\python\python37\lib\site-packages\pipenv\core.py", line 763, in batch_install
    pipenv.exceptions.InstallError: _cleanup_procs(procs, not blocking, failed_deps_queue, retry=retry)
    pipenv.exceptions.InstallError: File "c:\users\mm9q\appdata\local\programs\python\python37\lib\site-packages\pipenv\core.py", line 681, in _cleanup_procs
    pipenv.exceptions.InstallError: raise exceptions.InstallError(c.dep.name, extra=err_lines)

    Guillaume Klein
    Is it the complete error log?