Following up on my message yesterday, the config.json for the downloaded YourTTS seems to be using a VITS model, and when looking in the VITS source code it has d_vectors which allows for speaker embeddings (more than a simple conditional variable), it can use a HiFiGan generator and has the option for SCL (Speaker Consistency Loss).
Am I assuming correctly that to train YourTTS I just train VITS with the specific settings outlined in the paper? :)
Also, am I allowed to make variations on the VITS source code?
From what I've seen the entire coqui library is under Mozilla Public License 2.0 and only the checkpoints are under the stricter CC BY-NC-ND 4.0 license. I intend to contribute/opensource any changes, but always good to know what's allowed.
I'm working on a server that can handle multiple concurrent requests.
Initializing the Synthesizer class takes a long time, so I initialize the class once and then reuse it for each request.
This works fine if the requests are made in series. But when I make several parallel requests, I get errors
"stack expects each tensor to be equal size, but got [1, 69] at entry 0 and [1, 71] at entry 2"
Is there any way to reduce time of initialization of Synthesizer class and still avoid errors during parallel requests?
Package Version
Cython 0.29.32
numpy 1.23.1
pip 22.2.2
pyworld 0.2.10
setuptools 41.2.0
wheel 0.37.1
Hi, I am using a custom ljspeech based format thus looking at formatters.py file. I encountered for the ljspeech_test formatter the following comment : "2 samples per speaker to avoid eval split issues".
What are the issues if there is an odd number of samples ?