guillaumekln on master
Updating the intel-mkl URL. (#5… (compare)
guillaumekln on master
Add CTranslate2 Change project cards title (compare)
In general I follow this default mode:
tools/learn_bpe.lua -size 30000 -save_bpe codes < input_tokenized
tools/tokenize.lua -bpe_model codes < input_tokenized
@guillaumekln my process for the BPE was:
The issue I see is that vocabulary size I get is around 11k tokens and last ones are some random chinese characters. So I guess I have some mistake somewhere in the process. Maybe it’s the number or merges for learn_bpe.lua (30000). I don’t know. Could you please advise if you have insight any here?
tools/learn_bpe.lua -size 32000 -save_bpe codes -tok_mode aggressive -tok_segment_numbers < input_raw
tools/tokenize.lua -bpe_model codes -mode aggressive -segment_numbers -joiner_annotate < input_raw
As for single quote backslash should also work. But if it doesn’t, please paste your JSON on some site like https://pastebin.com/ and give us a link to it. Thank you.
As for unk, what model did you train? What data did you use for it? And for how long did you train it, using what hardware?
@guillaumekln now I used only Lua version of tokenizer/learn_bpe + python “create_vocab” and I got this: https://drive.google.com/file/d/1Iwu0CvU48WzYfl1M6_LkKbtc8RvN-ymQ/view?usp=sharing
14323 tokens instead of 32k.
<unk>
by <unk>
and then -replace_unk
can do its job at inference time.