Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Guillaume Klein
    @guillaumekln
    Do you mind opening an issue on the GitHub repository?
    Daniel Marín
    @dmar1n
    Sure, I will try to add further info there. Thanks
    Guillaume Klein
    @guillaumekln
    Thanks
    Daniel Marín
    @dmar1n
    Hi @guillaumekln, I created the issue in GitHub, but I might close it and open a new one. After further tests, it seems the performance drop is related to the combination of TF 2.3 and OpenNMT 2.17 (at least, just downgrading to OpenNMT 2.15 seems to fix the problem). In a different machine, I had tested the TF 2.4 and OpenNMT 2.17, and there was even a performance increase of around 15% (as other users had pointed out), but it seems that with TF 2.3 there is an issue.
    Guillaume Klein
    @guillaumekln
    Ok. I'm trying with TensorFlow 2.3 and OpenNMT-tf 2.17 but I don't reproduce the issue. You might need to include more information such as the training configuration, model definition, and full training logs
    Daniel Marín
    @dmar1n
    I have updated the GitHub issue. I confirm the problem is not related to the shared embeddings, as I originally thought, but to the upgrade to OpenNMT 2.17 under TensorFlow 2.3.1. With the exact same configuration and OpenNMT 2.16, the performance is restored. Let me know if you need more info.
    alrudak
    @alrudak
    Hello, we try to run CTranslate2 on A100 GPU and get errors:
    result = self._model.translate_batch(batch_tokens)
    RuntimeError: cuBLAS failed with status CUBLAS_STATUS_NOT_SUPPORTED
    does anyone know how to fix it ?
    | NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2
    Guillaume Klein
    @guillaumekln
    Hi. How did you convert the model and build the Translator object? Can you post these details in this issue: OpenNMT/CTranslate2#414
    alrudak
    @alrudak

    Before CUBLAS_NOT_SUPPORTED we got "Out of memory” error.

    We run 2 models, each of 300mb. But in Nvidia-SMI I saw that only 1GB of 40GB is used and then get “out of memory"

    To convert models to CTRanslate I used that command (to create 8 bit models)

    ct2-opennmt-tf-converter --model_path INPUT_ONMT_MODEL_DIR --model_spec TransformerBig --output_dir OUTPUT_DIR --quantization int8

    May be it’s because of “fabric manager”, that need to run with A100 GPU ?

    https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-460-32-03/index.html

    The Translator object creates like:

    model = ctranslate2.Translator(path, device=DEVICE)

    The package version:

    opennmt/ctranslate2:latest-ubuntu18-cuda11.0

    I posted these details to #414 thread on Github, but it marked as closed. Just want to make sure that someone will look on that deeper
    Guillaume Klein
    @guillaumekln
    Thanks for the info. I reopened the issue. We'll continue the discussion there.
    Gerardo Cervantes
    @gcervantes8
    Is it normal for a Transformer model to take much longer to train; 0.06 steps per second for Transformer vs. 3.15 steps per second for NMTBigV1? I am also running with TensorFlow version 2.1 so that could be why I'm getting this
    Guillaume Klein
    @guillaumekln
    What are the reported source and target tokens per second? These are better to compare performance.
    Gerardo Cervantes
    @gcervantes8
    For NMTBigV1 I get around 2800, for Transformer I get around 1205. I'm getting very similar numbers between source and target words per second
    Gerardo Cervantes
    @gcervantes8
    I'm noticing that when I do mixed precision with Transformers, the words per second for source and target jumps from 1205 to around 3150. But the steps per second is just slightly lower at 0.05 steps per second. I'm a little bit confused of the difference between words per second and steps per second, I'm still trying to understand that difference.
    Guillaume Klein
    @guillaumekln
    I think you set a very small batch size for the Transformer and one step accumulates many batches, hence the low number of steps per second.
    Gerardo Cervantes
    @gcervantes8
    Interesting. I will try with a bigger batch size, the batch size I'm using now is 64. Thank you.
    Gerardo Cervantes
    @gcervantes8
    Running with batch size of 512 and with mixed precision gave me a words per second of 6300, it also increased the steps per second to 0.22 which is much faster! Is this closer to the speed I should expect when training with a Transformer model?
    Guillaume Klein
    @guillaumekln
    Are you using auto_config? If yes, the Transformer batch size is defined in number of tokens, not examples. The default value is 3072 for example. You can also set batch_size=0 and the training will select a batch size for you.
    Gerardo Cervantes
    @gcervantes8
    I am doing auto_config. That may be why! I'll try with batch size 0 and report the speed
    Gerardo Cervantes
    @gcervantes8
    I tried batch size of 0 but got out of memory errors, I reduced the size of the vocabularies and tried batch size of 64 with batch type as examples and got around 8100 source and target words per second with 0.022 steps per second
    Guillaume Klein
    @guillaumekln
    I suggest sticking with batch type as tokens for Transformers. Otherwise you also need to override the effective_batch_size parameter that controls how many batches are accumulated.
    If you are getting started with Transformer models, you could just remove all the parameters that you defined (except the data) and use auto_config. See for example the quickstart: https://opennmt.net/OpenNMT-tf/quickstart.html
    Gerardo Cervantes
    @gcervantes8
    Thanks! Using Auto-config boosted the steps per second to about 0.35.
    alrudak
    @alrudak

    We made tests with the latest CTranslate2 (2.0) release and found that translation speed on Geforce RTX 2080 is 25% faster than 3090 on single GPU. We loaded 14 language models (around 4.7Gb in memory ) in both GPU.

    How is it can be ?

    We tested “int8” models with “int8” and “float” parameters. With beam_size 1 and 2. Same results - 2080 is always faster 3090.

    2080: Driver Version: 460.32.03 CUDA Version: 11.2
    3090: Driver Version: 460.73.01 CUDA Version: 11.2

    Ubuntu 20.04

    Running in Docker container.

    Guillaume Klein
    @guillaumekln
    James
    @JOHW85
    image.png
    image.png
    I was trying to continue from a training at checkpoint 5000. OpenNMT-tf did show that it was loading, but when it was training, it jumped back?
    I passed these parameters
    --model_type TransformerBig --checkpoint_path /media/jeremy/test/model_v4_big/ --config v4_big.yml --auto_config train --with_eval
    Guillaume Klein
    @guillaumekln
    See the documentation to continue the training: https://opennmt.net/OpenNMT-tf/training.html#continuing-from-a-stopped-training. You should not set the option --checkpoint_path.
    alrudak
    @alrudak
    Does "-save_beam_to” option to visualise beams works now ?
    Guillaume Klein
    @guillaumekln
    This option is not implemented in OpenNMT-tf.
    alrudak
    @alrudak
    thanks