Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • May 05 2021 16:41
    guillaumekln transferred #580
  • May 05 2021 16:22
    hangcao1004 opened #580
  • Apr 16 2021 02:57
    raymondhs closed #400
  • May 29 2020 07:14
    guillaumekln closed #579
  • May 29 2020 07:14
    guillaumekln commented #579
  • May 28 2020 20:37
    Roweida-Mohammed edited #579
  • May 28 2020 20:36
    Roweida-Mohammed opened #579
  • Feb 19 2020 16:08
    codecov-io commented #578
  • Feb 19 2020 16:08

    guillaumekln on master

    Updating the intel-mkl URL. (#5… (compare)

  • Feb 19 2020 16:08
    guillaumekln closed #578
  • Feb 19 2020 16:08
    guillaumekln commented #578
  • Feb 19 2020 15:59
    arturgontijo opened #578
  • Feb 12 2020 17:58
    melindaloubser1 closed #553
  • Feb 12 2020 17:58
    melindaloubser1 commented #553
  • Dec 13 2019 08:43
    guillaumekln transferred #574
  • Dec 13 2019 08:43
    guillaumekln transferred #577
  • Dec 13 2019 08:14
    tkngoutham edited #577
  • Dec 13 2019 08:13
    tkngoutham opened #577
  • Dec 13 2019 06:37
    tkngoutham commented #574
  • Oct 09 2019 11:36

    guillaumekln on master

    Add CTranslate2 Change project cards title (compare)

chiting765
@chiting765
I want to get 'Aqui es mi palabra "alkhcxli"' when I turn on -replace_unk
Guillaume Klein
@guillaumekln
What do you get without -replace_unk?
chiting765
@chiting765
I get the same translation with or without -replace_unk
Guillaume Klein
@guillaumekln
So it's working as expected. This option only replaces unknown words that are generated. Your issue is a training data issue.
Typically you want the model to learn to translate <unk> by <unk> and then -replace_unk can do its job at inference time.
chiting765
@chiting765
How to train the model to learn to translate <unk> by <unk>? I think the tokenizer separate <unk> into different tokens
Guillaume Klein
@guillaumekln
More like how to make source OOV also OOV in the target. But you can also use BPE and it will mitigate by a lot your issue.
chiting765
@chiting765
So after training my model using my training data, I need another set of data which contains OOV in both source and target file to continue to train the current model?
Is BPE model good for English to Spanish? I think it is probably good for German or Turkish, not sure about Spanish
Jean Senellart
@jsenellart
@chiting765 - yes BPE is excellent for English to Spanish, with this language pair, you should train a joint model.
regarding OOV issue - you just need to make sure that your vocabulary does not include all the vocabs of your training data which can happens when working with relatively small dataset
chiting765
@chiting765
@jsenellart Thanks!
Another question, when I train with Pyramidal deep bidirectional encoder -encoder_type pdbrnn, I got the following error:
[12/08/17 12:02:38 INFO] Preparing memory optimization...
/home/languageintell/torch/install/bin/luajit: bad argument #2 to '?' (out of range)
stack traceback:
[C]: at 0x7f9d0cce44b0
[C]: in function 'index'
./onmt/data/Batch.lua:181: in function 'addInputFeatures'
./onmt/data/Batch.lua:197: in function 'getSourceInput'
./onmt/modules/Encoder.lua:233: in function 'forward'
./onmt/modules/BiEncoder.lua:153: in function 'forward'
./onmt/modules/PDBiEncoder.lua:137: in function 'forward'
./onmt/Seq2Seq.lua:209: in function 'trainNetwork'
./onmt/utils/Memory.lua:39: in function 'optimize'
./onmt/train/Trainer.lua:137: in function '
init'
...anguageintell/torch/install/share/lua/5.1/torch/init.lua:91: in function 'new'
train.lua:282: in function 'main'
train.lua:288: in main chunk
[C]: in function 'dofile'
...tell/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
Did I miss anything or is it a bug?
Konstantin Glushak
@gsoul

now I used only Lua version of tokenizer/learn_bpe + python “create_vocab” and I got this: https://drive.google.com/file/d/1Iwu0CvU48WzYfl1M6_LkKbtc8RvN-ymQ/view?usp=sharing

14323 tokens instead of 32k.

@guillaumekln
Is this with the latest version of the Lua scripts? Also, could you try without changing the joiner marker? It should not be an issue but it wasn't properly tested I guess.

Sorry for the late reply. I used latest master version of Lua code and I just tried without the joiner marker, the result is exactly the same.

cat [pathto]/enfr/giga-fren.release2.token.en [pathto]/enfr/giga-fren.release2.token.fr | th tools/learn_bpe.lua -size 32000 -save_bpe giga_codes.txt

th tools/tokenize.lua -bpe_model giga_codes.txt -nparallel 6 -joiner_annotate -mode aggressive -segment_numbers < [pathto]/enfr/giga-fren.release2.token.shuf.en > [pathto]/enfr/giga-fren.release2.bpe.shuf.en

cat ./enfr/giga-fren.release2.token.shuf.en ./enfr/giga-fren.release2.token.shuf.fr > data/enfr/tmp.txt

python -m bin.build_vocab --save_vocab data/enfr/src-bpe.txt data/enfr/tmp.txt --size 32000
@guillaumekln please take a look, when you have a minute.
Konstantin Glushak
@gsoul

I’m thinking that maybe “-size 32000” for learn_bpe.lua is too much?

Because in comments it say that -size option is not a vocabulary size, but rather:

[[The number of merge operations to learn.]]

Konstantin Glushak
@gsoul
@guillaumekln when I increased “-size” parameter of learn_bpe to 64000, I finally could collect my vocabulary of 32k tokens. While there were ~48k unique tokens in the corpora.
lpzreq
@lpzreq
Anybody tested nvidia Volt with OpenNMT? Hou much it faster than GTX 1080 Ti?
Guillaume Klein
@guillaumekln
@gsoul Can you try without -nparallel? I already trained models on OpenNMT-tf with joiners and there were no issues. On a side note, you should pass -tok_mode aggressive -tok_segment_numbers to learn_bpe.lua for consistency.
Konstantin Glushak
@gsoul
@guillaumekln I will try, thank you. And if there are any recommendations, could you please advise on proper “-size” parameter for learn_bpe.lua?
Guillaume Klein
@guillaumekln
30000 is frequently used.
Konstantin Glushak
@gsoul
The issue is that when I use it for the giga-gren corpora, I can’t get a 32k vocabulary. Only about 14k. Though maybe that’s because I didn’t apply -tok_mode to learn_bpe, but did it for all the other steps. I’ll try it once more, thanks again!
Ben Peters
@bpopeters
Is there an easy way to extract feature embeddings? extract_embeddings.lua is only giving me the word embeddings.
Ratish Puduppully
@ratishsp
Hi, I am trying to understand the decoding process in decoder.lua. I want to know why we iterate in the reverse direction in backward method for t = batch.targetLength, 1, -1 do
Guillaume Klein
@guillaumekln
@bpopeters, in extract_embeddings.lua instead of catching torch.type(m) == "onmt.WordEmbedding" you could catch all torch.type(m) == "nn.LookupTable", dump m.weight and recognize the feature based on the vocabulary size.
@ratishsp, this is the idea of the backward pass: to walk the graph in the reverse order.
Vincent Nguyen
@vince62s
Did you guys read this https://arxiv.org/pdf/1712.05690.pdf
Jean Senellart
@jsenellart
Yes - met with them too. Same algorithms, different scores. The benchmarks are not good presented like that... and we need to do something about this
Data Scientist
@JayKimBravekjh
Hi everyone. I joined this room first time today, nice to meet you all
Jean Senellart
@jsenellart
Hi @bravekjh - welcome
Vincent Nguyen
@vince62s
@jsenellart @guillaumekln don't know which commit messed the server, but read this: http://forum.opennmt.net/t/error-in-rest-translation-server-lua-105-attempt-to-index-field-preds-a-nil-value-500/1114
Jean Senellart
@jsenellart
looks like the last message says it is fixed
Vincent Nguyen
@vince62s
oh, never mind, confused she said she was on master ....
Data Scientist
@JayKimBravekjh
thanks @jsenellart
Vincent Nguyen
@vince62s
do we need Lua instead of Luajit for lua-sentencepiece ? (see OpenNMT/lua-sentencepiece#3)
Vincent Nguyen
@vince62s
what is the command line to detokenize with lua-sentenpiece ?
Guillaume Klein
@guillaumekln
You are using the sentencepiece hook, right? I think you just need to call tools/detokenize.lua with it.
Vincent Nguyen
@vince62s
but with hooks/sentencepiece in the line ?
Guillaume Klein
@guillaumekln
Yes.
Vincent Nguyen
@vince62s
ok
Guillaume Klein
@guillaumekln
If you are preparing your data offline, you could also directly use the sentencepiece project and not the Lua wrapper.
lpzreq
@lpzreq
Hi. Will be added the google encoder in CTranslate? If not, where i can read information about google encoder?
Guillaume Klein
@guillaumekln
Hello, there is no plan to add it. You should at least change the forward logic and maybe the model loading based on the GoogleEncoder class.
lpzreq
@lpzreq
oh. thanks (
lpzreq
@lpzreq
why you not plan add it? GNMT not good? :)
Vincent Nguyen
@vince62s
Has someone tried to use CUDA 9 with Torch / Lua OpenNMT ?
Guillaume Klein
@guillaumekln
@lpzreq It's not a priority to support custom encoders in CTranslate. But a PR is always welcome.
lpzreq
@lpzreq
Ok. Thanks.