maximum_decoding_lengthin the parameters: https://opennmt.net/OpenNMT-tf/configuration.html. Maybe
length_penaltycan also help.
replace_unknown_targetuses the model attention to select the corresponding source token. However, it is well known that Transformer attention usually can not be used as target-source alignments. You should either constrain the attention to be an alignment or use subword tokenization (like SentencePiece) to avoid UNK. Note that the UNK token does not appear in the vocab but is automatically added when starting the training.
max_stepin the training parameters. There should be a warning about this somewhere in the logs. We just improved that for the next version: a more visible error message will be shown, see OpenNMT/OpenNMT-tf@21df1c7
case_markupoption from the Tokenizer.
effective_batch_sizeworks. The auto config for a Transformer model is
batch_size: 3072. This means that 9 iterations are required to accumulate the gradients to reach a batch size of 25000 on a single GPU. So does that mean that the actual effective batch size is
3072 * 9 = 27648? If this is true, then I would expect that if I set
8192, the actual effective batch size would be
8192 * 4 = 32768. This feels like enough of a difference in effective batch size that it would have an impact on training. Is this accurate?
batch_sizesince increasing it would result in OOM and decreasing it would result in under utilization of compute resources.