Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • May 05 2021 16:41
    guillaumekln transferred #580
  • May 05 2021 16:22
    hangcao1004 opened #580
  • Apr 16 2021 02:57
    raymondhs closed #400
  • May 29 2020 07:14
    guillaumekln closed #579
  • May 29 2020 07:14
    guillaumekln commented #579
  • May 28 2020 20:37
    Roweida-Mohammed edited #579
  • May 28 2020 20:36
    Roweida-Mohammed opened #579
  • Feb 19 2020 16:08
    codecov-io commented #578
  • Feb 19 2020 16:08

    guillaumekln on master

    Updating the intel-mkl URL. (#5… (compare)

  • Feb 19 2020 16:08
    guillaumekln closed #578
  • Feb 19 2020 16:08
    guillaumekln commented #578
  • Feb 19 2020 15:59
    arturgontijo opened #578
  • Feb 12 2020 17:58
    melindaloubser1 closed #553
  • Feb 12 2020 17:58
    melindaloubser1 commented #553
  • Dec 13 2019 08:43
    guillaumekln transferred #574
  • Dec 13 2019 08:43
    guillaumekln transferred #577
  • Dec 13 2019 08:14
    tkngoutham edited #577
  • Dec 13 2019 08:13
    tkngoutham opened #577
  • Dec 13 2019 06:37
    tkngoutham commented #574
  • Oct 09 2019 11:36

    guillaumekln on master

    Add CTranslate2 Change project cards title (compare)

Konstantin Glushak
@gsoul

@guillaumekln my process for the BPE was:

  1. Learn BPE codes on joined corpus of En and Fr files
  2. Apply those codes to both files through tokenize.lua
  3. Build 2 vocabularies for En and Fr files with python script from OpenNMT-tf. I set vocabulary size to 32000

The issue I see is that vocabulary size I get is around 11k tokens and last ones are some random chinese characters. So I guess I have some mistake somewhere in the process. Maybe it’s the number or merges for learn_bpe.lua (30000). I don’t know. Could you please advise if you have insight any here?

Guillaume Klein
@guillaumekln
What tokenization did you apply when using learn_bpe.lua?
Konstantin Glushak
@gsoul
I fed the tokenised file, and I used Spacy to produce it. Manual check showed that tokenization looks good and shouldn’t be an issue, imho.
Previously I learnt t2t model with the same tokenisation, just without BPE part
Guillaume Klein
@guillaumekln
What is the reasoning behind using the Spacy tokenizer? I would simply do something like this:
tools/learn_bpe.lua -size 32000 -save_bpe codes -tok_mode aggressive -tok_segment_numbers < input_raw
tools/tokenize.lua -bpe_model codes -mode aggressive -segment_numbers -joiner_annotate < input_raw
Konstantin Glushak
@gsoul

I’ll try it, thanks.

Another question I have is if it’s correct to learn BPE on a joined (src + tgt) file, but then for NMT training create separate dictionary for src and tgt?

Guillaume Klein
@guillaumekln
You can generate a single vocabulary on the joined tokenized corpus.
chiting765
@chiting765
Hi~~ I have a question about the REST server, when I test it with the curl command, I include the source text in JSON. If my source text contains quotations marks, single or double, how should I escape them? Thanks!
Konstantin Glushak
@gsoul
Hey, did you try backslash? “\"
chiting765
@chiting765
\" seems to be OK, but bash does not like \'
How should I escape apostrophe?
chiting765
@chiting765
Also, -replace_unk just doesn't work for me and I don't know why
Konstantin Glushak
@gsoul

As for single quote backslash should also work. But if it doesn’t, please paste your JSON on some site like https://pastebin.com/ and give us a link to it. Thank you.

As for unk, what model did you train? What data did you use for it? And for how long did you train it, using what hardware?

Konstantin Glushak
@gsoul

@guillaumekln now I used only Lua version of tokenizer/learn_bpe + python “create_vocab” and I got this: https://drive.google.com/file/d/1Iwu0CvU48WzYfl1M6_LkKbtc8RvN-ymQ/view?usp=sharing

14323 tokens instead of 32k.

Guillaume Klein
@guillaumekln
Is this with the latest version of the Lua scripts? Also, could you try without changing the joiner marker? It should not be an issue but it wasn't properly tested I guess.
chiting765
@chiting765
@gsoul, my test JSON is really simple: curl -v -H "Content-Type: application/json" -X POST -d '[{ "src" : "Mary\'s company" }]' http://IP_address:7784/translator/translate. I cannot execute the command because it is trying to find the match of the third ' I think.
As for -replace_unk, does it matter what model, data and machine did I use?
Also, I am trying to use GRU instead of LSTM as the -rnn_type, however both the perplexity and the validation perplexity is really large after two epoch. I have more than 228000 sentences, and usually with LSTM, the ppl are pretty small after some iterations.
chiting765
@chiting765
To use GRU, is there anything I need to do beside change -rnn_type in the train.lua? Do I need to do anything special during preprocessing?
chiting765
@chiting765
Can GRU handle additional features like case and domain?
lpzreq
@lpzreq
Why tokenization with case_feature can influence to translate result?
For example: House-> House House, sonne -> House
lpzreq
@lpzreq
For example: House-> House, house -> House
Guillaume Klein
@guillaumekln
@chiting765 This has been covered many times on the forum and GitHub issues. You should use a smaller learning rate with GRUs. They support all features.
chiting765
@chiting765
@guillaumekln Thank you! I will try a smaller learning rate
chiting765
@chiting765
As for the -replace_unk problem, I didn't train any bpe_model, is it required to use the -replace_unk feature?
Guillaume Klein
@guillaumekln
No
chiting765
@chiting765
So what could be the reason that -replace_unk does not work for me?
Guillaume Klein
@guillaumekln
What do you mean by does not work?
chiting765
@chiting765
I trained a model and I translate a sentence with some fake words I created which is definitely OOV, the translation of the fake words are some random target words instead of the fake words themselves.
Guillaume Klein
@guillaumekln
That's not how -replace_unk works, see the option description. It only replaces generated target unknown words.
chiting765
@chiting765
It should replace the target unknown token with the source token that has the highest attention weight right? But I just got some random target token as the translation for my unknown source tokens.
For example, if I translate 'Here is my "alkhcxli" word' to Spanish, I get 'Aqui es mi palabra "rnnr5555"'
chiting765
@chiting765
I want to get 'Aqui es mi palabra "alkhcxli"' when I turn on -replace_unk
Guillaume Klein
@guillaumekln
What do you get without -replace_unk?
chiting765
@chiting765
I get the same translation with or without -replace_unk
Guillaume Klein
@guillaumekln
So it's working as expected. This option only replaces unknown words that are generated. Your issue is a training data issue.
Typically you want the model to learn to translate <unk> by <unk> and then -replace_unk can do its job at inference time.
chiting765
@chiting765
How to train the model to learn to translate <unk> by <unk>? I think the tokenizer separate <unk> into different tokens
Guillaume Klein
@guillaumekln
More like how to make source OOV also OOV in the target. But you can also use BPE and it will mitigate by a lot your issue.
chiting765
@chiting765
So after training my model using my training data, I need another set of data which contains OOV in both source and target file to continue to train the current model?
Is BPE model good for English to Spanish? I think it is probably good for German or Turkish, not sure about Spanish
Jean Senellart
@jsenellart
@chiting765 - yes BPE is excellent for English to Spanish, with this language pair, you should train a joint model.
regarding OOV issue - you just need to make sure that your vocabulary does not include all the vocabs of your training data which can happens when working with relatively small dataset
chiting765
@chiting765
@jsenellart Thanks!
Another question, when I train with Pyramidal deep bidirectional encoder -encoder_type pdbrnn, I got the following error:
[12/08/17 12:02:38 INFO] Preparing memory optimization...
/home/languageintell/torch/install/bin/luajit: bad argument #2 to '?' (out of range)
stack traceback:
[C]: at 0x7f9d0cce44b0
[C]: in function 'index'
./onmt/data/Batch.lua:181: in function 'addInputFeatures'
./onmt/data/Batch.lua:197: in function 'getSourceInput'
./onmt/modules/Encoder.lua:233: in function 'forward'
./onmt/modules/BiEncoder.lua:153: in function 'forward'
./onmt/modules/PDBiEncoder.lua:137: in function 'forward'
./onmt/Seq2Seq.lua:209: in function 'trainNetwork'
./onmt/utils/Memory.lua:39: in function 'optimize'
./onmt/train/Trainer.lua:137: in function '
init'
...anguageintell/torch/install/share/lua/5.1/torch/init.lua:91: in function 'new'
train.lua:282: in function 'main'
train.lua:288: in main chunk
[C]: in function 'dofile'
...tell/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
Did I miss anything or is it a bug?
Konstantin Glushak
@gsoul

now I used only Lua version of tokenizer/learn_bpe + python “create_vocab” and I got this: https://drive.google.com/file/d/1Iwu0CvU48WzYfl1M6_LkKbtc8RvN-ymQ/view?usp=sharing

14323 tokens instead of 32k.

@guillaumekln
Is this with the latest version of the Lua scripts? Also, could you try without changing the joiner marker? It should not be an issue but it wasn't properly tested I guess.

Sorry for the late reply. I used latest master version of Lua code and I just tried without the joiner marker, the result is exactly the same.

cat [pathto]/enfr/giga-fren.release2.token.en [pathto]/enfr/giga-fren.release2.token.fr | th tools/learn_bpe.lua -size 32000 -save_bpe giga_codes.txt

th tools/tokenize.lua -bpe_model giga_codes.txt -nparallel 6 -joiner_annotate -mode aggressive -segment_numbers < [pathto]/enfr/giga-fren.release2.token.shuf.en > [pathto]/enfr/giga-fren.release2.bpe.shuf.en

cat ./enfr/giga-fren.release2.token.shuf.en ./enfr/giga-fren.release2.token.shuf.fr > data/enfr/tmp.txt

python -m bin.build_vocab --save_vocab data/enfr/src-bpe.txt data/enfr/tmp.txt --size 32000
@guillaumekln please take a look, when you have a minute.
Konstantin Glushak
@gsoul

I’m thinking that maybe “-size 32000” for learn_bpe.lua is too much?

Because in comments it say that -size option is not a vocabulary size, but rather:

[[The number of merge operations to learn.]]