Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Guillaume Klein
    @guillaumekln
    I'm not sure to understand. Can you describe more what you want to do?
    VishalKakkar
    @VishalKakkar
    Basically I want to check accuracy at each saving step instead of bleu score. And if it is not improving for certain no of steps I want to stop training.
    Guillaume Klein
    @guillaumekln
    Accuracy is not implemented for seq2seq models at the moment. But you could prepare a custom model definition that overrides the get_metrics and update_metrics methods. Then you could select your newly defined metric in the early stopping config.
    VishalKakkar
    @VishalKakkar
    Can you provide reference to examples if possible?
    Guillaume Klein
    @guillaumekln
    The SequenceClassifier model is defining an accuracy metric, maybe that could help: https://github.com/OpenNMT/OpenNMT-tf/blob/master/opennmt/models/sequence_classifier.py#L68-L72
    VishalKakkar
    @VishalKakkar
    Thanks @guillaumekln
    VishalKakkar
    @VishalKakkar

    Hi, @guillaumekln I tried the above method but getting this error "No scorer associated with the name: accuracy". Here is the code of my custom model: import opennmt

    class MyCustomTransformer(opennmt.models.Transformer):
    def init(self):
    super().init(
    source_inputter=opennmt.inputters.WordEmbedder(embedding_size=128,vocabulary_file_key="source_words_vocabulary"),
    target_inputter=opennmt.inputters.WordEmbedder(embedding_size=128,vocabulary_file_key="target_words_vocabulary"),
    num_layers=1,
    num_units=128,
    num_heads=8,
    ffn_inner_dim=512)

    # Here you can override any method from the Model class for a customized behavior.
    
    def get_metrics(self):
        return {"accuracy": tf.keras.metrics.Accuracy()}
    
    def update_metrics(self, metrics, predictions, labels):
        metrics["accuracy"].update_state(labels, predictions)

    model = MyCustomTransformer

    Guillaume Klein
    @guillaumekln
    What is your YAML configuration?
    VishalKakkar
    @VishalKakkar
    Here it is:

    eval:
    eval_delay: 360 # Every 1 hour
    external_evaluators: accuracy
    beam_width: 5
    early_stopping:
    metric: accuracy
    min_improvement: 0.2
    steps: 4

    infer:
    batch_size: 64
    beam_width: 5

    Guillaume Klein
    @guillaumekln
    In this case, accuracy is not an external evaluator as it is attached to the model. You should remove this line.
    Are you using a recent version of OpenNMT-tf? eval_delay no longer exists, see https://opennmt.net/OpenNMT-tf/v2_transition.html
    VishalKakkar
    @VishalKakkar
    I am using Version:1.25.0 as it is the same version prod boxes. Tell me one thing if it is not an external evaluation then how I will track that early stopping is happening on accuracy or it is happening at all or not?
    Guillaume Klein
    @guillaumekln
    First, please note that early stopping was added in version 2.0.0 so you should first update. Early stopping will work on accuracy if you set metric: accuracy in the configuration (as you did).
    VishalKakkar
    @VishalKakkar
    ok Got it, let me then try with newer version of opennmt. Thanks. @guillaumekln is there any way we can set it as external_evaluators also to track accuracy after every save_checkpoint_step.
    Guillaume Klein
    @guillaumekln
    Evaluation reports the loss, the metrics declared by the model, and the scores returned by external evaluators. As you declared the accuracy as a model metric, it will be automatically reported during evaluation.
    VishalKakkar
    @VishalKakkar
    Sure Thanks
    VishalKakkar
    @VishalKakkar
    Hi @guillaumekln I am missing something here, as after save_checkpoint_steps, I am getting this: INFO:tensorflow:Done running local_init_op.
    INFO:tensorflow:Evaluation predictions saved to ft_only_single_error_with_auto_stopping/eval/predictions.txt.141000
    INFO:tensorflow:BLEU evaluation score: 61.290000
    INFO:tensorflow:Finished evaluation at 2020-03-13-22:13:50
    INFO:tensorflow:Saving dict for global step 141000: global_step = 141000, loss = 0.98452455
    INFO:tensorflow:Saving 'checkpoint_path' summary for global step 141000: ft_only_single_error_with_auto_stopping/model.ckpt-141000
    INFO:tensorflow:Calling model_fn.
    accuracy is no where. and this was my config:

    eval:
    external_evaluators: bleu
    beam_width: 5
    early_stopping:
    metric: accuracy
    min_improvement: 0.2
    steps: 4

    infer:
    batch_size: 64
    beam_width: 5

    Guillaume Klein
    @guillaumekln
    All we discussed above is for OpenNMT-tf V2. So please update first.
    VishalKakkar
    @VishalKakkar
    @guillaumekln ohh ok
    VishalKakkar
    @VishalKakkar
    Hi @guillaumekln I have updated the version of Opennmt but Now at evaluation step I am getting some weird error. Here is the log:
    Traceback (most recent call last):
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/bin/onmt-main", line 8, in <module>
    sys.exit(main())
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/bin/main.py", line 204, in main
    checkpoint_path=args.checkpoint_path)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/runner.py", line 208, in train
    moving_average_decay=train_config.get("moving_average_decay"))
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/training.py", line 104, in call
    early_stop = self._evaluate(evaluator, step, moving_average=moving_average)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/training.py", line 183, in _evaluate
    evaluator(step)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/evaluation.py", line 268, in call
    loss, predictions = self._eval_fn(source, target)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in call
    result = self._call(args, **kwds)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize
    args, kwds))
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2389, in _get_concrete_function_internal_garbage_collected
    graphfunction, , _ = self._maybe_define_function(args, kwargs)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2703, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2593, in _create_graph_function
    capture_by_value=self._capture_by_value),
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 978, in func_graph_from_py_func
    func_outputs = python_func(*func_args,
    func_kwargs)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn
    return weak_wrapped_fn().wrapped(args, *kwds)
    File "/Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 968, in wrapper
    raise e.ag_error_metadata.to_exception(e)
    tensorflow.python.framework.errors_impl.NotFoundError: in converted code:
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/models/model.py:137 evaluate  *
        outputs, predictions = self(features, labels=labels)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py:778 __call__
        outputs = call_fn(cast_inputs, *args, **kwargs)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/models/sequence_to_sequence.py:165 call  *
        predictions = self._dynamic_decode(
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/models/sequence_to_sequence.py:239 _dynamic_decode  *
        sampled_ids, sampled_length, log_probs, alignment, _ = self.decoder.dynamic_decode(
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/decoders/decoder.py:396 dynamic_decode  *
        return decoding.dynamic_deco
    VishalKakkar
    @VishalKakkar
    and this error is not for just accuracy metric, even for blue score I am getting the same error.
    Guillaume Klein
    @guillaumekln
    Is that the complete error log? Looks like the end is missing.
    VishalKakkar
    @VishalKakkar
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/decoders/decoder.py:396 dynamic_decode
    return decoding.dynamic_decode(
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/utils/decoding.py:491 dynamic_decode

    ids, attention, lengths = decoding_strategy._finalize( # pylint: disable=protected-access
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/opennmt/utils/decoding.py:344 _finalize
    ids = tfa.seq2seq.gather_tree(step_ids, parent_ids, maximum_lengths, end_id)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_addons/seq2seq/beam_search_decoder.py:35 gather_tree

    return _beam_search_so.ops.addons_gather_tree(args, *kwargs)
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_addons/utils/resource_loader.py:49 ops
    self._ops = tf.load_op_library(get_path_to_datafile(self.relative_path))
    /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_core/python/framework/load_library.py:57 load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
    NotFoundError: /Vernacular/vishal.Linux_debian_9_3.py36.latest_onmt/lib/python3.6/site-packages/tensorflow_addons/custom_ops/seq2seq/_beam_search_ops.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs
    Here is the other part @guillaumekln
    Guillaume Klein
    @guillaumekln
    What TensorFlow version do you have installed?
    nashid
    @nashid
    I am looking for a working example of OpenNMT-tf and keras. Any suggestion for a github repo?
    nashid
    @nashid
    Anyone suggest any example for OpenNMT-tf with keras?
    Also what is the difference between OpenNMT-tf and https://github.com/lvapeab/nmt-keras
    Guillaume Klein
    @guillaumekln
    Do you mean an example using the Keras functional API? If yes, I'm not aware of such example.
    I did not know about nmt-keras, but I would say the difference is that nmt-keras prioritizes Keras APIs while OpenNMT-tf prioritizes TensorFlow APIs.
    nelsonmarcelino
    @nelsonmarcelino
    Wondering about status of HOROVOD use with OpenNMT-tf (tag 1.25.3) Does it work out of the box in a multi node environment? I've been trying to test distributed train_and_eval in a multi node environment with 8 V100 GPUs per node, but without luck.
    Guillaume Klein
    @guillaumekln
    It should work with OpenNMT-tf 1.x. Did you read this documentation? https://github.com/OpenNMT/OpenNMT-tf/blob/v1.25.3/docs/training.md#via-horovod-experimental
    nelsonmarcelino
    @nelsonmarcelino

    I'm trying to run with 16 GPUs (2 nodes * 8 GPU) using the following, but without luck:

    nmarcel8@hpclogin1:/s/mlsc/nmarcel8/nmt/Portuguese/2020-03-23> cat exp.sh
    export PYTHONPATH=/s/mlsc/nmarcel8/git/mtda
    base_dir=/s/mlsc/nmarcel8/git/mtda

    mpirun -np 16 \
    --hostfile ${PBS_NODEFILE} \
    -bind-to none -map-by slot \
    -mca btl_tcp_if_exclude lo,docker0 \
    -x NCCL_DEBUG=INFO \
    -x LD_LIBRARY_PATH \
    -x PYTHONPATH \
    -x PATH \
    -mca pml ob1 -mca btl ^openlib \
    python \
    $base_dir/opennmt/bin/main.py \
    train_and_eval \
    --model_type TransformerBigFP16 \
    --config exp.yml \
    --auto_config \
    --horovod

    Guillaume Klein
    @guillaumekln
    Is there an error? Or what is the issue?
    nelsonmarcelino
    @nelsonmarcelino
    Right now my job is in the queue waiting to launch. Quick question. Do we need to specify --num_gpus as a parameter to bin/main.py?
    I'm thinking if we specified the horovod flag that it's not necessary.
    Guillaume Klein
    @guillaumekln
    If I remember correctly you don't need to specify this option as Horovod will launch 16 separate processes.
    nelsonmarcelino
    @nelsonmarcelino
    Getting errors like this:
    I0326 15:42:23.700688 140253767198528 estimator.py:1147] Done calling model_fn.
    Traceback (most recent call last):
    File "/s/mlsc/nmarcel8/git/mtda/opennmt/bin/main.py", line 204, in <module>
    main()
    File "/s/mlsc/nmarcel8/git/mtda/opennmt/bin/main.py", line 175, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
    File "/s/mlsc/nmarcel8/git/mtda/opennmt/runner.py", line 301, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
    return executor.run()
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
    return self.run_local()
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
    saving_listeners=saving_listeners)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default
    saving_listeners)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1426, in _train_with_estimator_spec
    'There should be a CheckpointSaverHook to use saving_listeners. '
    ValueError: There should be a CheckpointSaverHook to use saving_listeners. Please set one of the RunConfig.save_checkpoints_steps or RunConfig.save_checkpoints_secs.
    <<When I run on a single node (using just 8 GPUs) everything works fine; but not on multinode.>>
    Guillaume Klein
    @guillaumekln
    Can you try using train instead of train_and_eval?
    nelsonmarcelino
    @nelsonmarcelino
    I submitted the job. Let me check on the status:

    nmarcel8@hpclogin1:/s/mlsc/nmarcel8/nmt/Portuguese/2020-03-23> fqstat

    Job ID Job Name App CPUs S Elap Time User


    6963772.hpcq exp HOROVOD 16 R 00:00:11 nmarcel8

    nelsonmarcelino
    @nelsonmarcelino
    With train I get messages like the following
    ...
    hpca02r01n04:180:799 [4] NCCL INFO comm 0x7f4c3b1e2fb0 rank 4 nranks 16 cudaDev 4 nvmlDev 4 - Init COMPLETE
    hpca02r01n04:178:801 [2] NCCL INFO Trees [0] 6->2->0/-1/-1 [1] 1->2->3/-1/-1 [2] -1->2->1/10/-1 [3] 3->2->6/-1/-1 [4] 6->2->0/-1/-1 [5] 1->2->3/-1/-1 [6] 10->2->1/-1/-1 [7] 3->2->6/-1/-1
    hpca02r01n04:176:800 [0] NCCL INFO Ring 07 : 8 -> 0 [receive] via NET/IB/3/GDRDMA
    hpca02r01n04:182:802 [6] NCCL INFO comm 0x7f5e6f1e1cd0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 - Init COMPLETE
    hpca02r01n04:179:805 [3] NCCL INFO comm 0x7efb47136920 rank 3 nranks 16 cudaDev 3 nvmlDev 3 - Init COMPLETE
    hpca02r01n04:178:801 [2] NCCL INFO comm 0x7fb877137090 rank 2 nranks 16 cudaDev 2 nvmlDev 2 - Init COMPLETE
    hpca02r02n05:38:629 [0] NCCL INFO Ring 07 : 8 -> 0 [send] via NET/IB/3/GDRDMA
    hpca02r02n05:38:629 [0] NCCL INFO Trees [0] 10->8->11/-1/-1 [1] 12->8->9/-1/-1 [2] 9->8->12/-1/-1 [3] 0->8->11/-1/-1 [4] 10->8->11/-1/-1 [5] 12->8->9/-1/-1 [6] 9->8->12/-1/-1 [7] -1->8->11/0/-1
    hpca02r02n05:38:629 [0] NCCL INFO comm 0x7f6197879f50 rank 8 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
    hpca02r01n04:176:800 [0] NCCL INFO Trees [0] 2->0->3/-1/-1 [1] 4->0->1/-1/-1 [2] 1->0->4/-1/-1 [3] -1->0->3/8/-1 [4] 2->0->3/-1/-1 [5] 4->0->1/-1/-1 [6] 1->0->4/-1/-1 [7] 8->0->3/-1/-1
    hpca02r01n04:176:800 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled for all sizes
    hpca02r01n04:176:800 [0] NCCL INFO comm 0x7effb60eb430 rank 0 nranks 16 cudaDev 0 nvmlDev 0 - Init COMPLETE
    hpca02r01n04:176:800 [0] NCCL INFO Launch mode Parallel
    I0326 16:02:05.270775 139641160255296 basic_session_run_hooks.py:262] loss = 9.732766, step = 0
    I0326 16:02:05.633767 140434656532288 basic_session_run_hooks.py:262] loss = 9.731406, step = 0
    I0326 16:02:05.639509 140400681133888 basic_session_run_hooks.py:262] loss = 9.727156, step = 0
    I0326 16:02:05.661324 140559522916160 basic_session_run_hooks.py:262] loss = 9.736551, step = 0
    I0326 16:02:05.671058 139726011828032 basic_session_run_hooks.py:262] loss = 9.732926, step = 0
    I0326 16:02:05.674440 140535427962688 basic_session_run_hooks.py:262] loss = 9.729025, step = 0
    I0326 16:02:05.675842 140709400389440 basic_session_run_hooks.py:262] loss = 9.741848, step = 0
    I0326 16:02:05.676417 139622124476224 basic_session_run_hooks.py:262] loss = 9.73002, step = 0
    I0326 16:02:05.681042 139969783359296 basic_session_run_hooks.py:262] loss = 9.730884, step = 0
    I0326 16:02:05.686120 139980528740160 basic_session_run_hooks.py:262] loss = 9.743285, step = 0
    I0326 16:02:05.689950 140085856864064 basic_session_run_hooks.py:262] loss = 9.726588, step = 0
    I0326 16:02:05.690726 140271360452416 basic_session_run_hooks.py:262] loss = 9.741635, step = 0
    I0326 16:02:05.692858 140061501163328 basic_session_run_hooks.py:262] loss = 9.737735, step = 0
    I0326 16:02:05.693819 140047958247232 basic_session_run_hooks.py:262] loss = 9.734624, step = 0
    I0326 16:02:05.695163 139628563687232 basic_session_run_hooks.py:262] loss = 9.73243, step = 0
    I0326 16:02:05.699985 139695624750912 basic_session_run_hooks.py:262] loss = 9.742135, step = 0
    W0326 16:02:17.888350 140400681133888 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
    W0326 16:02:17.890248 139641160255296 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Optimizer.minimize.
    W0326 16:02:17.891450 140434656532288 basic_session_run_hooks.py:724] It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0. You could increase the global step by passing tf.train.get_global_step() to Optimizer.applygradients or Optimizer.minimize.
    W0326 16:02:17.899081 140559522916160 basic
    ...
    nelsonmarcelino
    @nelsonmarcelino
    It is getting stuck on and just repeating the message <<It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 0 vs previous value: 0.>>