Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    not really sure how different finetuning would be
    Rolun
    @Rolun
    Awesome thanks, that's a great lead! :pray:
    Tรบlio Chiodi
    @tuliochiodi:matrix.org
    [m]
    After I start my training and launch tensorboard I can visualize the audios from train, valid and test. Is there a reason why my train and valid audios are always 0.5 seconds? Can I change it?
    omkarade
    @omkarade
    @weberjulian:matrix.org https://github.com/coqui-ai/TTS/discussions/1837
    i need help
    josh ๐Ÿธ
    @josh-coqui:matrix.org
    [m]
    just saw this new intro video to Coqui TTS :)
    Rolun
    @Rolun

    Following up on my message yesterday, the config.json for the downloaded YourTTS seems to be using a VITS model, and when looking in the VITS source code it has d_vectors which allows for speaker embeddings (more than a simple conditional variable), it can use a HiFiGan generator and has the option for SCL (Speaker Consistency Loss).
    Am I assuming correctly that to train YourTTS I just train VITS with the specific settings outlined in the paper? :)

    Also, am I allowed to make variations on the VITS source code?
    From what I've seen the entire coqui library is under Mozilla Public License 2.0 and only the checkpoints are under the stricter CC BY-NC-ND 4.0 license. I intend to contribute/opensource any changes, but always good to know what's allowed.

    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    @Rolun: i think there are other changes that are not in the config
    Rolun
    @Rolun
    Oh okay, so it wouldn't end up the same as the public "tts_models--multilingual--multi-dataset--your_tts"? :/
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    The paper talks about quite a few changes do i dont think you will just reproduce the same results with a config change or with a recipe for the dataset preparation.
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    hmm im reading the paper again, and the main difference is the speaker consistency loss
    and glowtts
    Rolun
    @Rolun
    I seem to be able to find most of the changes, but haven't gone through all yet:
    "embedded_language_dim": 4,
    "num_layers_text_encoder": 10,
    "hidden_channels": 192, (not quite sure why this one differ though)
    "use_speaker_encoder_as_loss": true,
    from TTS.vocoder.models.hifigan_generator import HifiganGenerator (imported in the soruce file)
    "num_layers_posterior_encoder": 16,
    The source code is referring to graphemes instead of phoneme
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    which is supported in coqui as well
    yes and they add a language identifier to the graphemes
    Rolun
    @Rolun
    "and the main difference is the speaker consistency loss" (don't know how to quote properly in Gitter :sweat_smile: )
    speaker consistency loss is renamed speaker_encoder_as_loss in the source code
    But the formula may differ
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    glowtts is in vits too
    Rolun
    @Rolun
    You mean it's in the same source code? :O
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    coqui has glowtts
    1 reply
    Rolun
    @Rolun
    Ah yes, I have seen that one, but how do you mean that glowtts is in vits as well? (Sorry I think I'm missing something obvious)
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    sorry i think i was wrong
    i confused multiple papers
    ignore me ๐Ÿ™‚
    Rolun
    @Rolun
    Haha np, been there done that :sweat_smile:
    But to conclude, it looks like VITS code & config โŠ† [VITS, YourTTS] models, but will keep looking if something is missing :)
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    looks like its close anyway
    Rolun
    @Rolun
    :thumbsup: :smiley: And thanks for taking a look with me @joachim.vanheuverzwijn:matrix.zoiper.com !
    Akmal
    @Wikidepia
    I want to train non english multispeaker model do I need to train speaker encoder from scratch or can I use english pretrained speaker encoder?
    2 replies
    Tรบlio Chiodi
    @tuliochiodi:matrix.org
    [m]
    Hello everyone! How can I switch between CPU and GPU audio synthesis? Can I do it using CLI?
    erogol ๐Ÿธ
    @golero:matrix.org
    [m]
    There should be a usecuda flag on cli
    Tรบlio Chiodi
    @tuliochiodi:matrix.org
    [m]
    Thanks erogol, I'll take a look!
    AlexBlack772
    @AlexBlack772

    I'm working on a server that can handle multiple concurrent requests.

    Initializing the Synthesizer class takes a long time, so I initialize the class once and then reuse it for each request.

    This works fine if the requests are made in series. But when I make several parallel requests, I get errors

    "stack expects each tensor to be equal size, but got [1, 69] at entry 0 and [1, 71] at entry 2"

    Is there any way to reduce time of initialization of Synthesizer class and still avoid errors during parallel requests?

    2 replies
    swissmontreux
    @swissmontreux:matrix.org
    [m]
    I have been through this many times and it will not install on windows.. first issues was MS C++ build tools not installed, which was corrected and fixed one issue and now other issues. I will look to use colab but I wouldn't mind getting this running in the home office to utilise the multiple GPU's I already have..
    sanjaesc
    @sanjaesc:matrix.org
    [m]
    cython is installed?
    swissmontreux
    @swissmontreux:matrix.org
    [m]
    yeah first thing I checked as the error above does imply that it isn't
    sanjaesc
    @sanjaesc:matrix.org
    [m]
    how did you check
    swissmontreux
    @swissmontreux:matrix.org
    [m]
    pip list:-

    Package Version


    Cython 0.29.32
    numpy 1.23.1
    pip 22.2.2
    pyworld 0.2.10
    setuptools 41.2.0
    wheel 0.37.1

    swissmontreux
    @swissmontreux:matrix.org
    [m]
    As always in life, you spend many hours trying to fix a problem, then post the issue online and you continue looking at the problem.. and you fix it. The Windows install documentation is misleading tbch and the problem was around where pip was installing the modules v running TTS install via .\scripts\pip install e . There was also the issue of MS C++ missing as well, or the correct version at least. So I now have Windows training a model with an old'ish GPU and its flying!!!
    CaraDuf
    @Ca-ressemble-a-du-fake

    Hi, I am using a custom ljspeech based format thus looking at formatters.py file. I encountered for the ljspeech_test formatter the following comment : "2 samples per speaker to avoid eval split issues".

    What are the issues if there is an odd number of samples ?

    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    isnt 2 samples a minimum ?
    1 reply
    so that there is one to put in validation (just thinking out loud)
    wannaphong
    @wannaphong:matrix.org
    [m]
    Has someone can share the link of speaker encoder trained with commonvoice all language?
    wannaphong
    @wannaphong:matrix.org
    [m]
    sanjaesc
    @sanjaesc:matrix.org
    [m]
    It just increases the speaker Id every 2 samples
    1 reply
    ivan provalov
    @iprovalov:matrix.org
    [m]
    I am seeing this message: ModuleNotFoundError: [!] Config for vits cannot be found.
    4 replies
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    Does anybody know why d-vectors are used in yourTTS and not x-vectors ?
    is it because of practical reasons ? (mozilla had d-vectors in the code) or it's for a different reason ?
    erogol ๐Ÿธ
    @golero:matrix.org
    [m]
    We always used dvectors and xvectors the same. But we just renamed it at some point.