guillaumekln on master
Updating the intel-mkl URL. (#5… (compare)
guillaumekln on master
Add CTranslate2 Change project cards title (compare)
As for single quote backslash should also work. But if it doesn’t, please paste your JSON on some site like https://pastebin.com/ and give us a link to it. Thank you.
As for unk, what model did you train? What data did you use for it? And for how long did you train it, using what hardware?
@guillaumekln now I used only Lua version of tokenizer/learn_bpe + python “create_vocab” and I got this: https://drive.google.com/file/d/1Iwu0CvU48WzYfl1M6_LkKbtc8RvN-ymQ/view?usp=sharing
14323 tokens instead of 32k.
<unk>
by <unk>
and then -replace_unk
can do its job at inference time.
now I used only Lua version of tokenizer/learn_bpe + python “create_vocab” and I got this: https://drive.google.com/file/d/1Iwu0CvU48WzYfl1M6_LkKbtc8RvN-ymQ/view?usp=sharing
14323 tokens instead of 32k.
@guillaumekln
Is this with the latest version of the Lua scripts? Also, could you try without changing the joiner marker? It should not be an issue but it wasn't properly tested I guess.
Sorry for the late reply. I used latest master version of Lua code and I just tried without the joiner marker, the result is exactly the same.
cat [pathto]/enfr/giga-fren.release2.token.en [pathto]/enfr/giga-fren.release2.token.fr | th tools/learn_bpe.lua -size 32000 -save_bpe giga_codes.txt
th tools/tokenize.lua -bpe_model giga_codes.txt -nparallel 6 -joiner_annotate -mode aggressive -segment_numbers < [pathto]/enfr/giga-fren.release2.token.shuf.en > [pathto]/enfr/giga-fren.release2.bpe.shuf.en
cat ./enfr/giga-fren.release2.token.shuf.en ./enfr/giga-fren.release2.token.shuf.fr > data/enfr/tmp.txt
python -m bin.build_vocab --save_vocab data/enfr/src-bpe.txt data/enfr/tmp.txt --size 32000