replace_unknown_target
uses the model attention to select the corresponding source token. However, it is well known that Transformer attention usually can not be used as target-source alignments. You should either constrain the attention to be an alignment or use subword tokenization (like SentencePiece) to avoid UNK. Note that the UNK token does not appear in the vocab but is automatically added when starting the training.
max_step
in the training parameters. There should be a warning about this somewhere in the logs. We just improved that for the next version: a more visible error message will be shown, see OpenNMT/OpenNMT-tf@21df1c7
case_markup
option from the Tokenizer.
effective_batch_size
works. The auto config for a Transformer model is effective_batch_size: 25000
and batch_size: 3072
. This means that 9 iterations are required to accumulate the gradients to reach a batch size of 25000 on a single GPU. So does that mean that the actual effective batch size is 3072 * 9 = 27648
? If this is true, then I would expect that if I set batch_size
to 8192
, the actual effective batch size would be 8192 * 4 = 32768
. This feels like enough of a difference in effective batch size that it would have an impact on training. Is this accurate?
batch_size
since increasing it would result in OOM and decreasing it would result in under utilization of compute resources.