Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    VishalKakkar
    @VishalKakkar

    eval:
    eval_delay: 360 # Every 1 hour
    external_evaluators: accuracy
    beam_width: 5
    early_stopping:
    metric: accuracy
    min_improvement: 0.2
    steps: 4

    infer:
    batch_size: 64
    beam_width: 5

    Guillaume Klein
    @guillaumekln
    In this case, accuracy is not an external evaluator as it is attached to the model. You should remove this line.
    Are you using a recent version of OpenNMT-tf? eval_delay no longer exists, see https://opennmt.net/OpenNMT-tf/v2_transition.html
    VishalKakkar
    @VishalKakkar
    I am using Version:1.25.0 as it is the same version prod boxes. Tell me one thing if it is not an external evaluation then how I will track that early stopping is happening on accuracy or it is happening at all or not?
    Guillaume Klein
    @guillaumekln
    First, please note that early stopping was added in version 2.0.0 so you should first update. Early stopping will work on accuracy if you set metric: accuracy in the configuration (as you did).
    VishalKakkar
    @VishalKakkar
    ok Got it, let me then try with newer version of opennmt. Thanks. @guillaumekln is there any way we can set it as external_evaluators also to track accuracy after every save_checkpoint_step.
    Guillaume Klein
    @guillaumekln
    Evaluation reports the loss, the metrics declared by the model, and the scores returned by external evaluators. As you declared the accuracy as a model metric, it will be automatically reported during evaluation.
    VishalKakkar
    @VishalKakkar
    Sure Thanks
    VishalKakkar
    @VishalKakkar
    Hi @guillaumekln I am missing something here, as after save_checkpoint_steps, I am getting this: INFO:tensorflow:Done running local_init_op.
    INFO:tensorflow:Evaluation predictions saved to ft_only_single_error_with_auto_stopping/eval/predictions.txt.141000
    INFO:tensorflow:BLEU evaluation score: 61.290000
    INFO:tensorflow:Finished evaluation at 2020-03-13-22:13:50
    INFO:tensorflow:Saving dict for global step 141000: global_step = 141000, loss = 0.98452455
    INFO:tensorflow:Saving 'checkpoint_path' summary for global step 141000: ft_only_single_error_with_auto_stopping/model.ckpt-141000
    INFO:tensorflow:Calling model_fn.
    accuracy is no where. and this was my config:

    eval:
    external_evaluators: bleu
    beam_width: 5
    early_stopping:
    metric: accuracy
    min_improvement: 0.2
    steps: 4

    infer:
    batch_size: 64
    beam_width: 5

    Guillaume Klein
    @guillaumekln
    All we discussed above is for OpenNMT-tf V2. So please update first.
    VishalKakkar
    @VishalKakkar
    @guillaumekln ohh ok
    VishalKakkar
    @VishalKakkar
    Hi @guillaumekln I have updated the version of Opennmt but Now at evaluation step I am getting some weird error. Here is the log:
    Traceback (most recent call last):
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/bin/onmt-main", line 8, in <module>
    sys.exit(main())
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/bin/main.py", line 204, in main
    checkpoint_path=args.checkpoint_path)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/runner.py", line 208, in train
    moving_average_decay=train_config.get("moving_average_decay"))
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/training.py", line 104, in call
    early_stop = self._evaluate(evaluator, step, moving_average=moving_average)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/training.py", line 183, in _evaluate
    evaluator(step)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/evaluation.py", line 268, in call
    loss, predictions = self._eval_fn(source, target)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in call
    result = self._call(args, **kwds)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize
    args, kwds))
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2389, in _get_concrete_function_internal_garbage_collected
    graphfunction, , _ = self._maybe_define_function(args, kwargs)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2703, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2593, in _create_graph_function
    capture_by_value=self._capture_by_value),
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 978, in func_graph_from_py_func
    func_outputs = python_func(*func_args,
    func_kwargs)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn
    return weak_wrapped_fn().wrapped(args, *kwds)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 968, in wrapper
    raise e.ag_error_metadata.to_exception(e)
    tensorflow.python.framework.errors_impl.NotFoundError: in converted code:
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/models/model.py:137 evaluate  *
        outputs, predictions = self(features, labels=labels)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py:778 __call__
        outputs = call_fn(cast_inputs, *args, **kwargs)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/models/sequence_to_sequence.py:165 call  *
        predictions = self._dynamic_decode(
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/models/sequence_to_sequence.py:239 _dynamic_decode  *
        sampled_ids, sampled_length, log_probs, alignment, _ = self.decoder.dynamic_decode(
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/decoders/decoder.py:396 dynamic_decode  *
        return decoding.dynamic_deco
    VishalKakkar
    @VishalKakkar
    and this error is not for just accuracy metric, even for blue score I am getting the same error.
    Guillaume Klein
    @guillaumekln
    Is that the complete error log? Looks like the end is missing.
    VishalKakkar
    @VishalKakkar
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/decoders/decoder.py:396 dynamic_decode
    return decoding.dynamic_decode(
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/utils/decoding.py:491 dynamic_decode

    ids, attention, lengths = decoding_strategy._finalize( # pylint: disable=protected-access
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/utils/decoding.py:344 _finalize
    ids = tfa.seq2seq.gather_tree(step_ids, parent_ids, maximum_lengths, end_id)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_addons/seq2seq/beam_search_decoder.py:35 gather_tree

    return _beam_search_so.ops.addons_gather_tree(args, *kwargs)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_addons/utils/resource_loader.py:49 ops
    self._ops = tf.load_op_library(get_path_to_datafile(self.relative_path))
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/framework/load_library.py:57 load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
    NotFoundError: /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_addons/custom_ops/seq2seq/_beam_search_ops.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs
    Here is the other part @guillaumekln
    Guillaume Klein
    @guillaumekln
    What TensorFlow version do you have installed?
    nashid
    @nashid
    I am looking for a working example of OpenNMT-tf and keras. Any suggestion for a github repo?
    nashid
    @nashid
    Anyone suggest any example for OpenNMT-tf with keras?
    Also what is the difference between OpenNMT-tf and https://github.com/lvapeab/nmt-keras
    Guillaume Klein
    @guillaumekln
    Do you mean an example using the Keras functional API? If yes, I'm not aware of such example.
    I did not know about nmt-keras, but I would say the difference is that nmt-keras prioritizes Keras APIs while OpenNMT-tf prioritizes TensorFlow APIs.
    nelsonmarcelino
    @nelsonmarcelino
    Wondering about status of HOROVOD use with OpenNMT-tf (tag 1.25.3) Does it work out of the box in a multi node environment? I've been trying to test distributed train_and_eval in a multi node environment with 8 V100 GPUs per node, but without luck.
    Guillaume Klein
    @guillaumekln
    It should work with OpenNMT-tf 1.x. Did you read this documentation? https://github.com/OpenNMT/OpenNMT-tf/blob/v1.25.3/docs/training.md#via-horovod-experimental
    nelsonmarcelino
    @nelsonmarcelino

    I'm trying to run with 16 GPUs (2 nodes * 8 GPU) using the following, but without luck:

    nmarcel8@hpclogin1:/s/mlsc/nmarcel8/nmt/Portuguese/2020-03-23> cat exp.sh
    export PYTHONPATH=/s/mlsc/nmarcel8/git/mtda
    base_dir=/s/mlsc/nmarcel8/git/mtda

    mpirun -np 16 \
    --hostfile ${PBS_NODEFILE} \
    -bind-to none -map-by slot \
    -mca btl_tcp_if_exclude lo,docker0 \
    -x NCCL_DEBUG=INFO \
    -x LD_LIBRARY_PATH \
    -x PYTHONPATH \
    -x PATH \
    -mca pml ob1 -mca btl ^openlib \
    python \
    $base_dir/opennmt/bin/main.py \
    train_and_eval \
    --model_type TransformerBigFP16 \
    --config exp.yml \
    --auto_config \
    --horovod

    Guillaume Klein
    @guillaumekln
    Is there an error? Or what is the issue?
    nelsonmarcelino
    @nelsonmarcelino
    Right now my job is in the queue waiting to launch. Quick question. Do we need to specify --num_gpus as a parameter to bin/main.py?
    I'm thinking if we specified the horovod flag that it's not necessary.
    Guillaume Klein
    @guillaumekln
    If I remember correctly you don't need to specify this option as Horovod will launch 16 separate processes.
    nelsonmarcelino
    @nelsonmarcelino
    Getting errors like this:
    I0326 15:42:23.700688 140253767198528 estimator.py:1147] Done calling model_fn.
    Traceback (most recent call last):
    File "/s/mlsc/nmarcel8/git/mtda/opennmt/bin/main.py", line 204, in <module>
    main()
    File "/s/mlsc/nmarcel8/git/mtda/opennmt/bin/main.py", line 175, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
    File "/s/mlsc/nmarcel8/git/mtda/opennmt/runner.py", line 301, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
    return executor.run()
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
    return self.run_local()
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
    saving_listeners=saving_listeners)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default
    saving_listeners)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1426, in _train_with_estimator_spec
    'There should be a CheckpointSaverHook to use saving_listeners. '
    ValueError: There should be a CheckpointSaverHook to use saving_listeners. Please set one of the RunConfig.save_checkpoints_steps or RunConfig.save_checkpoints_secs.
    <<When I run on a single node (using just 8 GPUs) everything works fine; but not on multinode.>>
    Guillaume Klein
    @guillaumekln
    Can you try using train instead of train_and_eval?
    nelsonmarcelino
    @nelsonmarcelino
    I submitted the job. Let me check on the status:

    nmarcel8@hpclogin1:/s/mlsc/nmarcel8/nmt/Portuguese/2020-03-23> fqstat

    Job ID Job Name App CPUs S Elap Time User


    6963772.hpcq exp HOROVOD 16 R 00:00:11 nmarcel8

    nelsonmarcelino
    @nelsonmarcelino
    With train I get messages like the following
    ...
    hpca02r01n04:180:799 [4] NCCL INFO comm 0x7f4c3b1e2fb0 rank 4 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE
    hpca02r01n04:178:801 [2] NCCL INFO Trees [0] 6->2->0/-1/-1 [1] 1->2->3/-1/-1 [2] -1->2->1/10/-1 [3] 3->2->6/-1/-1 [4] 6->2->0/-1/-1 [5] 1->2->3/-1/-1 [6] 10->2->1/-1/-1 [7] 3->2->6/-1/-1
    hpca02r01n04:176:800 [0] NCCL INFO Ring 07 : 8 -> 0 [receive] via NET/IB/3/GDRDMA
    hpca02r01n04:182:802 [6] NCCL INFO comm 0x7f5e6f1e1cd0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE
    hpca02r01n04:179:805 [3] NCCL INFO comm 0x7efb47136920 rank 3 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE
    hpca02r01n04:178:801 [2] NCCL INFO comm 0x7fb877137090 rank 2 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE
    hpca02r02n05:38:629 [0] NCCL INFO Ring 07 : 8 -> 0 [send] via NET/IB/3/GDRDMA
    hpca02r02n05:38:629 [0] NCCL INFO Trees [0] 10->8->11/-1/-1 [1] 12->8->9/-1/-1 [2] 9->8->12/-1/-1 [3] 0->8->11/-1/-1 [4] 10->8->11/-1/-1 [5] 12->8->9/-1/-1 [6] 9->8->12/-1/-1 [7] -1->8->11/0/-1
    hpca02r02n05:38:629 [0] NCCL INFO comm 0x7f6197879f50 rank 8 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
    hpca02r01n04:176:800 [0] NCCL INFO Trees [0] 2->0->3/-1/-1 [1] 4->0->1/-1/-1 [2] 1->0->4/-1/-1 [3] -1->0->3/8/-1 [4] 2->0->3/-1/-1 [5] 4->0->1/-1/-1 [6] 1->0->4/-1/-1 [7] 8->0->3/-1/-1
    hpca02r01n04:176:800 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled for all sizes
    hpca02r01n04:176:800 [0] NCCL INFO comm 0x7effb60eb430 rank 0 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
    hpca02r01n04:176:800 [0] NCCL INFO Launch mode Parallel
    I0326 16:02:05.270775 139641160255296 basic_session_run_hooks.py:262] loss = 9.732766, step = 0
    I0326 16:02:05.633767 140434656532288 basic_session_run_hooks.py:262] loss = 9.731406, step = 0
    I0326 16:02:05.639509 140400681133888 basic_session_run_hooks.py:262] loss = 9.727156, step = 0
    I0326 16:02:05.661324 140559522916160 basic_session_run_hooks.py:262] loss = 9.736551, step = 0
    I0326 16:02:05.671058 139726011828032 basic_session_run_hooks.py:262] loss = 9.732926, step = 0
    I0326 16:02:05.674440 140535427962688 basic_session_run_hooks.py:262] loss = 9.729025, step = 0
    I0326 16:02:05.675842 140709400389440 basic_session_run_hooks.py:262] loss = 9.741848, step = 0
    I0326 16:02:05.676417 139622124476224 basic_session_run_hooks.py:262] loss = 9.73002, step = 0
    I0326 16:02:05.681042 139969783359296 basic_session_run_hooks.py:262] loss = 9.730884, step = 0
    I0326 16:02:05.686120 139980528740160 basic_session_run_hooks.py:262] loss = 9.743285, step = 0
    I0326 16:02:05.689950 140085856864064 basic_session_run_hooks.py:262] loss = 9.726588, step = 0
    I0326 16:02:05.690726 140271360452416 basic_session_run_hooks.py:262] loss = 9.741635, step = 0
    I0326 16:02:05.692858 140061501163328 basic_session_run_hooks.py:262] loss = 9.737735, step = 0
    I0326 16:02:05.693819 140047958247232 basic_session_run_hooks.py:262] loss = 9.734624, step = 0
    I0326 16:02:05.695163 139628563687232 basic_session_run_hooks.py:262] loss = 9.73243, step = 0
    I0326 16:02:05.699985 139695624750912 basic_session_run_hooks.py:262] loss = 9.742135, step = 0
    W0326 16:02:17.888350 140400681133888 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
    W0326 16:02:17.890248 139641160255296 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
    W0326 16:02:17.891450 140434656532288 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.applygradients or Optimizer.minimize.
    W0326 16:02:17.899081 140559522916160 basic
    ...
    nelsonmarcelino
    @nelsonmarcelino
    It is getting stuck on and just repeating the message <<It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0.>>
    Guillaume Klein
    @guillaumekln
    I don't recall this type of logs in the context of Horovod. Note that I'm working to integrate Horovd in V2: OpenNMT/OpenNMT-tf#639. Maybe you could try this one if it is possible for you to update.
    Jordi Mas
    @jordimas
    Hi
    Everytime that I train a model with the same data set a get a different model, since the output of neural networks is not deterministic. Even if I compare 2 models with very similar BLEUs they have very different way to translate. All the time a new model improves in some sentences but it does worse with other. This makes really difficult to
    a) Build quality assess based on previous results since every time models translate differently
    b) I can be confusing to users since a new model translate different from previous one
    I guess that I can play with the validation corpus and re-training.
    Any recommendation on how to mitigate both?
    Guillaume Klein
    @guillaumekln
    Hi. Is it a specific set of sentences for which the translation should not change? If yes, you probably want to turn these sentences into translation memories.
    Jordi Mas
    @jordimas
    No, this is a generic translator English -> Catalan. I found challenging that every model translates differenly. I guess that for quality I'm left with metrics like BLEU and human review to evaluate the new models. Thanks
    nelsonmarcelino
    @nelsonmarcelino
    @guillaumekln I'll have a look. Thanks.
    Soumya Chennabasavaraj
    @soumyacbr
    Hi I am trying to train a translation model with basic rnn 2 layers enc and dec. The training process is stuck at 1st validation. It shows loading dataset . number of examples and thats it. What is the solution to this ? Thanks
    Guillaume Klein
    @guillaumekln
    How big is the validation dataset?
    Soumya Chennabasavaraj
    @soumyacbr
    32867
    training set is 178822