Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    josh 🐸
    @josh-coqui:matrix.org
    [m]

    Text-driven Emotional Style Control and Cross-speaker Style Transfer in Neural TTS

    Expressive text-to-speech has shown improved performance in recent years. However, the style control of synthetic speech is often restricted to discrete emotion categories and requires training data recorded by the target speaker in the target style. In many practical situations, users may not have reference speech recorded in target emotion but still be interested in controlling speech style just by typing text description of desired emotional style. In this paper, we propose a text-based interface for emotional style control and cross-speaker style transfer in multi-speaker TTS. We propose the bi-modal style encoder which models the semantic relationship between text description embedding and speech style embedding with a pretrained language model. To further improve cross-speaker style transfer on disjoint, multi-style datasets, we propose the novel style loss. The experimental results show that our model can generate high-quality expressive speech even in unseen style.
    weberjulian 🐸: Edresson 🐸 erogol 🐸 the samples are really good! ^^^
    josh 🐸
    @josh-coqui:matrix.org
    [m]
    hey there pdav πŸ‘‹πŸ˜„
    pdav
    @pdav:matrix.org
    [m]
    Hi!
    I like reading the papers :)
    josh 🐸
    @josh-coqui:matrix.org
    [m]

    DelightfulTTS 2

    Current text to speech (TTS) systems usually leverage a cascaded acoustic model and vocoder pipeline with mel spectrograms as the intermediate representations, which suffer from two limitations: 1) the acoustic model and vocoder are separately trained instead of jointly optimized, which incurs cascaded errors; 2) the intermediate speech representations (e.g., mel-spectrogram) are pre-designed and lose phase information, which are sub-optimal. To solve these problems, in this paper, we develop DelightfulTTS 2, a new end-to-end speech synthesis system with automatically learned speech representations and jointly optimized acoustic model and vocoder. Specifically, 1) we propose a new codec network based on vector quantized auto-encoders with adversarial training (VQ-GAN) to extract intermediate frame-level speech representations (instead of traditional representations like mel-spectrograms) and reconstruct speech waveform; 2) we jointly optimize the acoustic model (based on DelightfulTTS) and the vocoder (the decoder of VQ-GAN), with an auxiliary loss on the acoustic model to predict intermediate speech representations. Experiments show that DelightfulTTS 2 achieves a CMOS gain +0.14 over DelightfulTTS, and more method analyses further verify the effectiveness of the developed system.

    https://cognitivespeech.github.io/delightfultts2

    haha there's a new one :D
    nice samples!
    Joachim Vanheuverzwijn
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    can't really say its TTS (with my laptop speakers)
    erogol 🐸
    @golero:matrix.org
    [m]
    Keyword -> β€œinternal dataset” :)
    Joachim Vanheuverzwijn
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    yes, neatly recorded and manually cut mostly likely
    josh 🐸
    @josh-coqui:matrix.org
    [m]

    Style Transfer of Audio Effects with Differentiable Signal Processing

    We present a framework that can impose the audio effects and production style from one recording to another by example with the goal of simplifying the audio production process. We train a deep neural network to analyze an input recording and a style reference recording, and predict the control parameters of audio effects used to render the output. In contrast to past work, we integrate audio effects as differentiable operators in our framework, perform backpropagation through audio effects, and optimize end-to-end using an audio-domain loss. We use a self-supervised training strategy enabling automatic control of audio effects without the use of any labeled or paired training data. We survey a range of existing and new approaches for differentiable signal processing, showing how each can be integrated into our framework while discussing their trade-offs. We evaluate our approach on both speech and music tasks, demonstrating that our approach generalizes both to unseen recordings and even to sample rates different than those seen during training. Our approach produces convincing production style transfer results with the ability to transform input recordings to produced recordings, yielding audio effect control parameters that enable interpretability and user interaction.

    https://csteinmetz1.github.io/DeepAFx-ST/

    the samples for voice are good, but i wish they used diff sentences for source/ref
    josh 🐸
    @josh-coqui:matrix.org
    [m]
    We propose Guided-TTS, a high-quality text-to-speech (TTS) model that does not require any transcript of target speaker using classifier guidance. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained phoneme classifier for classifier guidance. Our unconditional diffusion model learns to generate speech without any context from untranscribed speech data. For TTS synthesis, we guide the generative process of the diffusion model with a phoneme classifier trained on a large-scale speech recognition dataset. We present a norm-based scaling method that reduces the pronunciation errors of classifier guidance in Guided-TTS. We show that Guided-TTS achieves a performance comparable to that of the state-of-the-art TTS model, Grad-TTS, without any transcript for LJSpeech. We further demonstrate that Guided-TTS performs well on diverse datasets including a long-form untranscribed dataset.

    we guide the generative process of the diffusion model with a phoneme classifier trained on a large-scale speech recognition dataset

    this sounds pretty equivalent to running STT over the audio and using the transcripts for training... πŸ€”

    3 replies
    I would guess that "phoneme classifier" is worse than just using STT
    skeilnet
    @skeilnet:mozilla.org
    [m]
    I like the intonations even if it's still a tat robotic compared to Grad for instance
    josh 🐸
    @josh-coqui:matrix.org
    [m]
    looks like their approach beats the STT grad-tts-asr by a small margin
    2 replies
    spectie
    @spectie:matrix.org
    [m]
    So it's not significant
    erogol 🐸
    @golero:matrix.org
    [m]

    Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

    This work examines the content and usefulness of disentangled phone and speaker representations from two separately trained VQ-VAE systems: one trained on multilingual data and another trained on monolingual data. We explore the multi- and mono- lingual models using four small proof-of-concept tasks: copy- synthesis, voice transformation, linguistic code-switching, and content-based privacy masking. From these tasks, we reflect on how disentangled phone and speaker representations can be used to manipulate speech in a meaningful way. Our experi- ments demonstrate that the VQ representations are suitable for these tasks, including creating new voices by mixing speaker representations together. We also present our novel technique to conceal the content of targeted words within an utterance by manipulating phone VQ codes, while retaining speaker iden- tity and intelligibility of surrounding words. Finally, we discuss recommendations for further increasing the viability of disen- tangled representations

    https://arxiv.org/pdf/2105.01573.pdf

    p0p4k
    @p0p4k:matrix.org
    [m]
    Hello, when doing incremental learning, we need old data , plus new data right? If not, then the model will forget (not completely, but partially, cause new data might have data similar to the old one) the old data or learn to give different responses for the older text. Am I right?
    1 reply
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    wrong room πŸ™‚ join the coqui-ai/TTS room
    2 replies
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    Do you have an idea how fast it forgets ?
    for ASR, it's really fast, for TTS, no idea
    p0p4k
    @p0p4k:matrix.org
    [m]
    Yes, I'll get to the point.
    What if we do not have the old training data, and want to train on new. Is it okay to sample ( lot of samples) from the current model, add it to the new training data and then train? Does that mean we can retain the older behaviour too?
    Is there any paper like that? Cause this could be general ml stuff.
    weberjulian 🐸
    @weberjulian:matrix.org
    [m]
    Hum yeah, that might work but you might loose some quality from training on generated data. Not sure it's an active field of research since the obvious answer would be to keep training with the old data.
    1 reply
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    You would have mistakes that get worse with every iteration. (only one iteration in this case maybe so might not be too bad)
    1 reply
    For ASR, training on pseudolabeling is similar, it's done with a 80/20 split usually. 80 pseudo, 20 ground truth
    would make more sense to use the original data in this case i think
    p0p4k
    @p0p4k:matrix.org
    [m]
    Ok thanks, sorry for posting like this out of the blue.
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    better ask in the other room so that this one is easy to navigate for new papers πŸ™‚ (My 2 cents, im not linked to coqui so my opinion is probably irrelevant)
    weberjulian 🐸
    @weberjulian:matrix.org
    [m]
    1 reply
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    do you mean the broken link on the bottom ?
    the samples here sound ok? https://singgan.github.io/
    hmm they sound lower pitch indeed
    josh 🐸
    @josh-coqui:matrix.org
    [m]
    Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. During the training, the framework does not only explicitly characterize emotion styles, but also explores the ordinal nature of emotions by quantifying the differences with other emotions. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector. The objective and subjective evaluations have validated the effectiveness of the proposed framework. To our best knowledge, this research is the first study on modelling, synthesizing and evaluating mixed emotions in speech.

    Speech Synthesis with Mixed Emotions

    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    seems to be with LM trained on common voice as well though
    Joachim
    @joachim.vanheuverzwijn:matrix.zoiper.com
    [m]
    :point_up: Edit: I spoke with the author, the common voice test set was excluded from the ml training