Activity
yutongli
@yutongli
Hi Guillaume, I ran the basic command to train a customized transfer model after upgrading to 2.9, but saw the following error: Use eager execution and:
tf.data.TFRecordDataset(path)
INFO:tensorflow:Accumulate gradients of 2 iterations to reach effective batch size of 25000
Traceback (most recent call last):
File "/conda/envs/simcloud/bin/onmt-main", line 8, in <module>
sys.exit(main())
File "/conda/envs/simcloud/lib/python3.6/site-packages/opennmt/bin/main.py", line 223, in main
hvd=hvd)
File "/conda/envs/simcloud/lib/python3.6/site-packages/opennmt/runner.py", line 205, in train
devices = misc.get_devices(count=num_devices)
File "/conda/envs/simcloud/lib/python3.6/site-packages/opennmt/utils/misc.py", line 34, in get_devices
count, len(devices), "is" if len(devices) == 1 else "are"))
ValueError: Requested 8 devices but only 1 is visible
I guess it's hardware related incompatibility issue? my gpu is cuda 10.0
Guillaume Klein
@guillaumekln
TensorFlow 2.1 requires CUDA 10.1.
yutongli
@yutongli
:+1:
Michael A. Martin
@mmartin9684-sil

Is anyone else seeing a mecab install failure when installing OpenNMT-tf 2.9.1?

(nlp_mt_testing) C:\Users\mm9q\PycharmProjects\nlp_mt_testing>pipenv update
Running $pipenv lock then$ pipenv sync.
Locking [dev-packages] dependencies…
Success!
Locking [packages] dependencies…
Success!
Updated Pipfile.lock (c017cc)!
Installing dependencies from Pipfile.lock (c017cc)…
An error occurred while installing mecab-python3==0.996.5 --hash=sha256:0758c4e428c9eda01b14f2e93b4b48055264c4044c43992a8421bfd2a27d9ae0 --hash=sha256:0e309f7f55c608b66c5dbd9a1e62f2686e9ff0923a5f
1c28406985c4d8a80549 --hash=sha256:1af08a46774ac219bf93cc0c52d87d5cbcce9f4c3abba6b14c374f81ebc718c5 --hash=sha256:1bfe39e3c6a5be7bf54d36fcc14c5938fc831960507f9d7337cd2cc0e8de07d3 --hash=sha256:2d
066afe09a95716facf60a673f538257f595a282beb1e021d1b495f995ee70e --hash=sha256:304ee365de78cf9c48373122bb68a5f2c2cb60c9d058a690b3cdaf1059d4c91c --hash=sha256:3187479c79151f384f44c0d12a87305bfab94be
f90183bd6db0bcd44c2c3374b --hash=sha256:32864543977281dcbb52f54e42d9fd060c9d869874d49806bbee5a0ff689e665 --hash=sha256:3f2a591460f28faf27318961c49e70b9e464a283c8184a677bdddc70aa8835b6 --hash=sha2
56:429c92effeb46336e994b2d4b29a8b9e57ff947df55fd5b8042315fdb50d573a --hash=sha256:4732677c74dae291587d930d307bfd92776a198cbb1ea2836ede652180c9dbd1 --hash=sha256:47ba0cf7b5b03137ba7b3998a6fe82da14
07ac577016df10b6da7f1979dfb0ca --hash=sha256:4fb35dd2486b3d8b1868cfa8e1772e195c26a90b02f2509e4b61e11fd846b442 --hash=sha256:52aa68cb5bbf993d3862feb9c477f230ed27c4cfc8e3fa3388419561cd02ffcb --hash
4601a31ffbca6524e328bec3ef098abff10 --hash=sha256:730898c5e55f6a5f894251603afd44ebfdbe7ddec00113c233ef1b9c91b8690a --hash=sha256:8d13b2732b3951a0defaa56acd628d1d83f123fdafee4f40701e001a66717a85 -
-hash=sha256:92019635b0686a6aa30f4d908155696077abe8848203e102ae7befab0b96b9ea --hash=sha256:9af5050a093172123be6eedf65f263fb936885c5c200b1808cb1d2f15a59e871 --hash=sha256:b33f212afff92fc292a8b20a
98693da8! Will try again.
================================ 120/120 - 00:01:13
Installing initially failed dependencies…
pipenv.exceptions.InstallError: File "c:\users\mm9q\appdata\local\programs\python\python37\lib\site-packages\pipenv\core.py", line 2611, in do_sync

pipenv.exceptions.InstallError: File "c:\users\mm9q\appdata\local\programs\python\python37\lib\site-packages\pipenv\core.py", line 1253, in do_init

pipenv.exceptions.InstallError: File "c:\users\mm9q\appdata\local\programs\python\python37\lib\site-packages\pipenv\core.py", line 859, in do_install_dependencies
pipenv.exceptions.InstallError: retry_list, procs, failed_deps_queue, requirements_dir, **install_kwargs
pipenv.exceptions.InstallError: File "c:\users\mm9q\appdata\local\programs\python\python37\lib\site-packages\pipenv\core.py", line 763, in batch_install
pipenv.exceptions.InstallError: _cleanup_procs(procs, not blocking, failed_deps_queue, retry=retry)
pipenv.exceptions.InstallError: File "c:\users\mm9q\appdata\local\programs\python\python37\lib\site-packages\pipenv\core.py", line 681, in _cleanup_procs
pipenv.exceptions.InstallError: raise exceptions.InstallError(c.dep.name, extra=err_lines)
[pipenv.exceptions.Install

Guillaume Klein
@guillaumekln
Is it the complete error log?
Michael A. Martin
@mmartin9684-sil
Yes, it is.
The only update in the Pipfile was the OpenNMT-tf package. The same error occurs with the most recent release (2.9.1), as well as with the prior 2.8.1 release.
Guillaume Klein
@guillaumekln
The package mecab-python3 is a new dependency of sacrebleu. I see that they don't publish packages for Windows and this likely why you are seeing an error. I will look to pin sacrebleu to a previous version. In the meantime you could try to manually install mecab-python3.
Michael A. Martin
@mmartin9684-sil
Thank you for this feedback. It seems that sacrebleu 1.4.4 doesn't have the dependency on mecab-python3, so using that older release works around this issue. Many thanks!
arunnambiar27
@arunnambiar27
Can anyone help with how to create this python prog OpenNMT-tf/third_party/learn_bpe.py ? I am new to opennm and trying out default module
Also how can i give my own data to translate?
Guillaume Klein
@guillaumekln
learn_bpe can be found here: https://github.com/rsennrich/subword-nmt/
arunnambiar27
@arunnambiar27
Thank you @guillaumekln . Can you specify which files have to be replaced ?
Also, How to use the pre-trained english-german dictionary model provided in openNMT-tf
arunnambiar27
@arunnambiar27
How to stop the training after some checkpoint?
Guillaume Klein
@guillaumekln
Looks like you have lots of questions. I suggest that you open a topic on the forum so that it is easier to answer.
yutongli
@yutongli
Hi @guillaumekln , could I have a quick question about the inference. I've trained a good transformer model, and previously I ran the inference(about 23million datapoints; batch size 64) against the model, the inference job went smoothly, though it took maybe 8-10 hours. Now I am running inference with a larger data set(150million datapoints), the job got killed after running roughly 26 hours. Since the inference just does sequential processing, why larger data set would cause an crash after running longer time? Anything specific that I should pay attention to for inference?
Guillaume Klein
@guillaumekln
The inference does not filter the data. So the first thing to check is that if you have very long sentences in your data that can cause out of memory issues.
yutongli
@yutongli
@guillaumekln Thanks for your feedback. Actually I normalized/filtered the data before inference, so number of characters of each data point was between 3 and 70, inclusive. I checked the cpu and memory usage and found that for the inference job, the cpu maintains 95-110% usage and memory is always around 50%. Only a single inference job runs on the node (the node has 32-core cpu, 8 gpus, 128G memory). Any clue?
it seems that the inference is actually taken care by CPU only, are we able to run inference with GPUs? like I did for the transformer model training with all the 8 GPUs on the node
yutongli
@yutongli
BTW, I also noticed that you mentioned in a closed topic that 'You could instead split your file and run separate inference processes to leverage multiple GPUs.'. I also splitted the large data set to be 6 pieces, so each contains about 25m data entries. But starting the 2nd inference job on the same node threw exception and failed. Is there anything I missed which I should specify to leverage multiple GPUs to run inference?
Renjith Sasidharan
@renjithsasidharan
Hello @guillaumekln, I wanted to ask about the effectiveness of transformer model on small dataset(200K). I have been training a small transformer model (1 layer, 512 dim, 4 heads). I am trying to extract amount, date from OCR text from receipts, so the source sentences are very long(~500 words) and target sentences are just one word. I have run the training for 100K iterations, but the loss seems very high (~1.5). Should I keep running it for longer? Is a transformer model as effective as an RNN on a small dataset like mine?
Guillaume Klein
@guillaumekln
@yutongli Most likely the inference is running on GPU otherwise you would have a higher CPU usage. If you want to run multiple inference jobs, you should restrict the GPU visibility for each process with CUDA_VISIBLE_DEVICES (you'll find more info on Google). As for the original issue (killed job), is the memory usage increasing?
@renjithsasidharan Hi. I see that you posted the same question on the forum. Let's continue the discussion there.
NL
@nslatysheva
Hey @guillaumekln, I'm interested in understanding translation errors made by trained models, specifically by (1) looking at attention weights from transformer heads and (2) finding training examples with similar hidden state vectors (I can compute similarity myself, just need to know how to access the raw numbers at different parts of the network). Any advice? :)
Guillaume Klein
@guillaumekln
You probably need to dive into the model code and place print statements when you need them. Just remember that the model is executed in graph mode so you need to use TensorFlow print function: https://www.tensorflow.org/api_docs/python/tf/print
NL
@nslatysheva
thanks, will dive in :) just curious, does there exist any overview/presentation/tutorial as an intro to the code structure?
yutongli
@yutongli
@guillaumekln Thanks for getting back to me. I monitored the CPU and memory usage for the inference job for some time and the CPU is around 150%, and memory is about 10%. Does this mean the job is running on GPU? How high could indicate the job running on CPU?
Guillaume Klein
@guillaumekln
You can use nvidia-smi to see processes running on the GPU. If it was running on the CPU, I think TensorFlow would be using all CPU cores by default.
@nslatysheva There is no such tutorial, but the code is not that big.
yutongli
@yutongli
@guillaumekln Thanks very much! After some research, I managed to make the inference job to only run on CPUs, by controlling the GPU Visibility via Nvidia CUDA environment variable. (Now the CPU usage shows ~2700%, other than 150% previously. Also the GPU usage remains 0% per monitoring.) However, the inference output(predictions) does not seem to be dumped gradually and incrementally. It seems that the job keeps working hard behind the scene, holding output in memory for a very long time, without dumping at a regular pace. (Per my observation, the regular dumping happens in the last 2 hours before the job completed, given the duration of the entire job is about 30 hours). Can we specify any parameters to control the dumping during inference? if so, would that speed up the entire processing?
Guillaume Klein
@guillaumekln
You can control this behavior but disabling it will actually make the overall decoding slower. See the parameter infer > length_bucket_width in https://opennmt.net/OpenNMT-tf/configuration.html. It is set to 5 with auto_config but you can disable it with 0.
Sirogha
@Sirogha
Hello. I try to train en-ru model with sentencepiece mode.
When i completed build vocab with BPE mode, so i did't find letter Z. It's strange, because this letter appear more than 2 million times in my source. How it can be?
Memduh Gökırmak
@MemduhG
I'm getting this error when I try to run onmt-build-vocab:
AttributeError: module 'tensorflow_core._api.v2.random' has no attribute 'Generator'
Guillaume Klein
@guillaumekln
@MemduhG What TensorFlow version do you have installed?
@Sirogha How did you look for the letter in the vocabulary?
yutongli
@yutongli
@guillaumekln I have trained a transformer model using opennmt-tf and want to serve it in production for real time inference. I am considering https://github.com/OpenNMT/CTranslate2, is Intel MKL the minimum requirement for building CTranslate2? If so and we end up with not being able to have CTranslate2 in production environment, would you please advise anything else, all i want to target is to bring the trained transformer model into production environment, so anything that can better the real time inference will be highly helpful! Thanks
yutongli
@yutongli
btw, the production is c++ environment
Guillaume Klein
@guillaumekln
Yes, CTranslate2 only requires Intel MKL for CPU translation. It seems to be exactly what you need.
yutongli
@yutongli
Thanks!
Soumya Chennabasavaraj
@soumyacbr
I have trained a transformer model, Now I'm doing the inference. But the inference is stuck after translating some 20 sentences. what could be the problem ? has anyone faced this?. Plus it does not even throw any error. Its just stick after translating 20th sentence.
Guillaume Klein
@guillaumekln
You should probably just let it run. The test file is reordered internally to increase efficiency.
Soumya Chennabasavaraj
@soumyacbr
Yes I left it to run. Finally it did run. Thanks
Anna Samiotou
Hello, does OpenNMT-tf support protected sequences/placeholders i.e. ｟URL：http://www.opennmt.net｠as described in https://opennmt.net/OpenNMT/tools/tokenization/#special-characters? Provided that SP/BPE or unigram is deployed through OpenNTM tokenizer. Thanks in advance
Guillaume Klein
@guillaumekln
Hi, you would need to remove the value part (：http://www.opennmt.net in this example) before calling OpenNMT-tf. The remaining part ｟URL｠ will be treated as any other tokens during training/inference.
Anna Samiotou