Where communities thrive

  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
Repo info
    Sure Thanks
    Hi @guillaumekln I am missing something here, as after save_checkpoint_steps, I am getting this: INFO:tensorflow:Done running local_init_op.
    INFO:tensorflow:Evaluation predictions saved to ft_only_single_error_with_auto_stopping/eval/predictions.txt.141000
    INFO:tensorflow:BLEU evaluation score: 61.290000
    INFO:tensorflow:Finished evaluation at 2020-03-13-22:13:50
    INFO:tensorflow:Saving dict for global step 141000: global_step = 141000, loss = 0.98452455
    INFO:tensorflow:Saving 'checkpoint_path' summary for global step 141000: ft_only_single_error_with_auto_stopping/model.ckpt-141000
    INFO:tensorflow:Calling model_fn.
    accuracy is no where. and this was my config:

    external_evaluators: bleu
    beam_width: 5
    metric: accuracy
    min_improvement: 0.2
    steps: 4

    batch_size: 64
    beam_width: 5

    Guillaume Klein
    All we discussed above is for OpenNMT-tf V2. So please update first.
    @guillaumekln ohh ok
    Hi @guillaumekln I have updated the version of Opennmt but Now at evaluation step I am getting some weird error. Here is the log:
    Traceback (most recent call last):
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/bin/onmt-main", line 8, in <module>
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/bin/main.py", line 204, in main
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/runner.py", line 208, in train
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/training.py", line 104, in call
    early_stop = self._evaluate(evaluator, step, moving_average=moving_average)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/training.py", line 183, in _evaluate
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/evaluation.py", line 268, in call
    loss, predictions = self._eval_fn(source, target)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in call
    result = self._call(args, **kwds)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize
    args, kwds))
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2389, in _get_concrete_function_internal_garbage_collected
    graphfunction, , _ = self._maybe_define_function(args, kwargs)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2703, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2593, in _create_graph_function
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 978, in func_graph_from_py_func
    func_outputs = python_func(*func_args,
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn
    return weak_wrapped_fn().wrapped(args, *kwds)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 968, in wrapper
    raise e.ag_error_metadata.to_exception(e)
    tensorflow.python.framework.errors_impl.NotFoundError: in converted code:
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/models/model.py:137 evaluate  *
        outputs, predictions = self(features, labels=labels)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py:778 __call__
        outputs = call_fn(cast_inputs, *args, **kwargs)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/models/sequence_to_sequence.py:165 call  *
        predictions = self._dynamic_decode(
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/models/sequence_to_sequence.py:239 _dynamic_decode  *
        sampled_ids, sampled_length, log_probs, alignment, _ = self.decoder.dynamic_decode(
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/decoders/decoder.py:396 dynamic_decode  *
        return decoding.dynamic_deco
    and this error is not for just accuracy metric, even for blue score I am getting the same error.
    Guillaume Klein
    Is that the complete error log? Looks like the end is missing.
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/decoders/decoder.py:396 dynamic_decode
    return decoding.dynamic_decode(
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/utils/decoding.py:491 dynamic_decode

    ids, attention, lengths = decoding_strategy._finalize( # pylint: disable=protected-access
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/utils/decoding.py:344 _finalize
    ids = tfa.seq2seq.gather_tree(step_ids, parent_ids, maximum_lengths, end_id)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_addons/seq2seq/beam_search_decoder.py:35 gather_tree

    return _beam_search_so.ops.addons_gather_tree(args, *kwargs)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_addons/utils/resource_loader.py:49 ops
    self._ops = tf.load_op_library(get_path_to_datafile(self.relative_path))
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/framework/load_library.py:57 load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
    NotFoundError: /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_addons/custom_ops/seq2seq/_beam_search_ops.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs
    Here is the other part @guillaumekln
    Guillaume Klein
    What TensorFlow version do you have installed?
    I am looking for a working example of OpenNMT-tf and keras. Any suggestion for a github repo?
    Anyone suggest any example for OpenNMT-tf with keras?
    Also what is the difference between OpenNMT-tf and https://github.com/lvapeab/nmt-keras
    Guillaume Klein
    Do you mean an example using the Keras functional API? If yes, I'm not aware of such example.
    I did not know about nmt-keras, but I would say the difference is that nmt-keras prioritizes Keras APIs while OpenNMT-tf prioritizes TensorFlow APIs.
    Wondering about status of HOROVOD use with OpenNMT-tf (tag 1.25.3) Does it work out of the box in a multi node environment? I've been trying to test distributed train_and_eval in a multi node environment with 8 V100 GPUs per node, but without luck.
    Guillaume Klein
    It should work with OpenNMT-tf 1.x. Did you read this documentation? https://github.com/OpenNMT/OpenNMT-tf/blob/v1.25.3/docs/training.md#via-horovod-experimental

    I'm trying to run with 16 GPUs (2 nodes * 8 GPU) using the following, but without luck:

    nmarcel8@hpclogin1:/s/mlsc/nmarcel8/nmt/Portuguese/2020-03-23> cat exp.sh
    export PYTHONPATH=/s/mlsc/nmarcel8/git/mtda

    mpirun -np 16 \
    --hostfile ${PBS_NODEFILE} \
    -bind-to none -map-by slot \
    -mca btl_tcp_if_exclude lo,docker0 \
    -x PATH \
    -mca pml ob1 -mca btl ^openlib \
    python \
    $base_dir/opennmt/bin/main.py \
    train_and_eval \
    --model_type TransformerBigFP16 \
    --config exp.yml \
    --auto_config \

    Guillaume Klein
    Is there an error? Or what is the issue?
    Right now my job is in the queue waiting to launch. Quick question. Do we need to specify --num_gpus as a parameter to bin/main.py?
    I'm thinking if we specified the horovod flag that it's not necessary.
    Guillaume Klein
    If I remember correctly you don't need to specify this option as Horovod will launch 16 separate processes.
    Getting errors like this:
    I0326 15:42:23.700688 140253767198528 estimator.py:1147] Done calling model_fn.
    Traceback (most recent call last):
    File "/s/mlsc/nmarcel8/git/mtda/opennmt/bin/main.py", line 204, in <module>
    File "/s/mlsc/nmarcel8/git/mtda/opennmt/bin/main.py", line 175, in main
    File "/s/mlsc/nmarcel8/git/mtda/opennmt/runner.py", line 301, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
    return executor.run()
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
    return self.run_local()
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1426, in _train_with_estimator_spec
    'There should be a CheckpointSaverHook to use saving_listeners. '
    ValueError: There should be a CheckpointSaverHook to use saving_listeners. Please set one of the RunConfig.save_checkpoints_steps or RunConfig.save_checkpoints_secs.
    <<When I run on a single node (using just 8 GPUs) everything works fine; but not on multinode.>>
    Guillaume Klein
    Can you try using train instead of train_and_eval?
    I submitted the job. Let me check on the status:

    nmarcel8@hpclogin1:/s/mlsc/nmarcel8/nmt/Portuguese/2020-03-23> fqstat

    Job ID Job Name App CPUs S Elap Time User

    6963772.hpcq exp HOROVOD 16 R 00:00:11 nmarcel8

    With train I get messages like the following
    hpca02r01n04:180:799 [4] NCCL INFO comm 0x7f4c3b1e2fb0 rank 4 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE
    hpca02r01n04:178:801 [2] NCCL INFO Trees [0] 6->2->0/-1/-1 [1] 1->2->3/-1/-1 [2] -1->2->1/10/-1 [3] 3->2->6/-1/-1 [4] 6->2->0/-1/-1 [5] 1->2->3/-1/-1 [6] 10->2->1/-1/-1 [7] 3->2->6/-1/-1
    hpca02r01n04:176:800 [0] NCCL INFO Ring 07 : 8 -> 0 [receive] via NET/IB/3/GDRDMA
    hpca02r01n04:182:802 [6] NCCL INFO comm 0x7f5e6f1e1cd0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE
    hpca02r01n04:179:805 [3] NCCL INFO comm 0x7efb47136920 rank 3 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE
    hpca02r01n04:178:801 [2] NCCL INFO comm 0x7fb877137090 rank 2 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE
    hpca02r02n05:38:629 [0] NCCL INFO Ring 07 : 8 -> 0 [send] via NET/IB/3/GDRDMA
    hpca02r02n05:38:629 [0] NCCL INFO Trees [0] 10->8->11/-1/-1 [1] 12->8->9/-1/-1 [2] 9->8->12/-1/-1 [3] 0->8->11/-1/-1 [4] 10->8->11/-1/-1 [5] 12->8->9/-1/-1 [6] 9->8->12/-1/-1 [7] -1->8->11/0/-1
    hpca02r02n05:38:629 [0] NCCL INFO comm 0x7f6197879f50 rank 8 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
    hpca02r01n04:176:800 [0] NCCL INFO Trees [0] 2->0->3/-1/-1 [1] 4->0->1/-1/-1 [2] 1->0->4/-1/-1 [3] -1->0->3/8/-1 [4] 2->0->3/-1/-1 [5] 4->0->1/-1/-1 [6] 1->0->4/-1/-1 [7] 8->0->3/-1/-1
    hpca02r01n04:176:800 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled for all sizes
    hpca02r01n04:176:800 [0] NCCL INFO comm 0x7effb60eb430 rank 0 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
    hpca02r01n04:176:800 [0] NCCL INFO Launch mode Parallel
    I0326 16:02:05.270775 139641160255296 basic_session_run_hooks.py:262] loss = 9.732766, step = 0
    I0326 16:02:05.633767 140434656532288 basic_session_run_hooks.py:262] loss = 9.731406, step = 0
    I0326 16:02:05.639509 140400681133888 basic_session_run_hooks.py:262] loss = 9.727156, step = 0
    I0326 16:02:05.661324 140559522916160 basic_session_run_hooks.py:262] loss = 9.736551, step = 0
    I0326 16:02:05.671058 139726011828032 basic_session_run_hooks.py:262] loss = 9.732926, step = 0
    I0326 16:02:05.674440 140535427962688 basic_session_run_hooks.py:262] loss = 9.729025, step = 0
    I0326 16:02:05.675842 140709400389440 basic_session_run_hooks.py:262] loss = 9.741848, step = 0
    I0326 16:02:05.676417 139622124476224 basic_session_run_hooks.py:262] loss = 9.73002, step = 0
    I0326 16:02:05.681042 139969783359296 basic_session_run_hooks.py:262] loss = 9.730884, step = 0
    I0326 16:02:05.686120 139980528740160 basic_session_run_hooks.py:262] loss = 9.743285, step = 0
    I0326 16:02:05.689950 140085856864064 basic_session_run_hooks.py:262] loss = 9.726588, step = 0
    I0326 16:02:05.690726 140271360452416 basic_session_run_hooks.py:262] loss = 9.741635, step = 0
    I0326 16:02:05.692858 140061501163328 basic_session_run_hooks.py:262] loss = 9.737735, step = 0
    I0326 16:02:05.693819 140047958247232 basic_session_run_hooks.py:262] loss = 9.734624, step = 0
    I0326 16:02:05.695163 139628563687232 basic_session_run_hooks.py:262] loss = 9.73243, step = 0
    I0326 16:02:05.699985 139695624750912 basic_session_run_hooks.py:262] loss = 9.742135, step = 0
    W0326 16:02:17.888350 140400681133888 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
    W0326 16:02:17.890248 139641160255296 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
    W0326 16:02:17.891450 140434656532288 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.applygradients or Optimizer.minimize.
    W0326 16:02:17.899081 140559522916160 basic
    It is getting stuck on and just repeating the message <<It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0.>>
    Guillaume Klein
    I don't recall this type of logs in the context of Horovod. Note that I'm working to integrate Horovd in V2: OpenNMT/OpenNMT-tf#639. Maybe you could try this one if it is possible for you to update.
    Jordi Mas
    Everytime that I train a model with the same data set a get a different model, since the output of neural networks is not deterministic. Even if I compare 2 models with very similar BLEUs they have very different way to translate. All the time a new model improves in some sentences but it does worse with other. This makes really difficult to
    a) Build quality assess based on previous results since every time models translate differently
    b) I can be confusing to users since a new model translate different from previous one
    I guess that I can play with the validation corpus and re-training.
    Any recommendation on how to mitigate both?
    Guillaume Klein
    Hi. Is it a specific set of sentences for which the translation should not change? If yes, you probably want to turn these sentences into translation memories.
    Jordi Mas
    No, this is a generic translator English -> Catalan. I found challenging that every model translates differenly. I guess that for quality I'm left with metrics like BLEU and human review to evaluate the new models. Thanks
    @guillaumekln I'll have a look. Thanks.
    Soumya Chennabasavaraj
    Hi I am trying to train a translation model with basic rnn 2 layers enc and dec. The training process is stuck at 1st validation. It shows loading dataset . number of examples and thats it. What is the solution to this ? Thanks
    Guillaume Klein
    How big is the validation dataset?
    Soumya Chennabasavaraj
    training set is 178822
    each epoch takes around 4-5 sec for training. but it gets stuck when it needs to do validation
    Guillaume Klein
    Can you post the training logs? How long did you wait to conclude that the process is stuck?
    Soumya Chennabasavaraj
    command : onmt_train -data data/demo -save_model demo-model --src_word_vec_size 50 --tgt_word_vec_size 50 --report_every 1 --train_steps 100 --valid_steps 2 --enc_rnn_size 200 --dec_rnn_size 200
    Guillaume Klein
    Looks like you are in the wrong channel. Please post that in the OpenNMT-py channel or on the forum.
    Soumya Chennabasavaraj
    Okay. thanks