Hi ! First, I want to say a big thank you to Deezer's team for publishing their work. That's great to see such a technology available. I hope it will leads many companies into doing the same.
I'm developing a C++ port of spleeter as a side project (right here: https://github.com/gvne/spleeterpp). My goal is to give people the opportunity to use your technology within plug-ins.
At the moment, I've been able to use the pre-trained models with waveforms as input/output to tensorflow. My next step is to use them with an stft frame as input and a mask as output.I did so (almost) successfully but realized that performances (CPU and quality wise) are the worse. Getting into your code a bit more, I came across your 'T' parameter (https://github.com/deezer/spleeter/wiki/3.-Models#audio-parameters). If I understand it properly, it means that the current pre-trained models are meant to process batches of 512 frames (~12sec of audio) right ?
If I'm correct, you can imagine that such a latency in a plugin isn't acceptable :).
So, my questions would be:
Anyway, thank you once again for your great work !
Hi all, this looks like an amazing tool. However I am looking for something that could cleanup audio for a podcast, so those that should like the were in a submarine.
Is there any way to just separate out clean audio, if there is no accompaniment music? And the holy grail, could it separate out two different speakers into separate channels?
I am guessing the available stems don't allow this, but if this network was trained would it be possible, if not why not. and if not, is any one else doing any source separation for spoken audio?