What if the next silence is not after the next word, but a few more words until the speaker actually pauses for a second?
Though to be sure, I'll study the YT subtitle file a bit more closely to try and figure out how the timestamps are determined, if it matches when the speaker actually pauses for a second before continuing to speak
Thanks for suggesting to split at the next silence after the timestamp tho, Ciaran!
I did a quick test and the aligned.json output looked like gibberish. I tried running the $ bin/align.sh --audio data/test1/audio.wav --script data/test1/transcript.txt --aligned data/test1/aligned.json --tlog data/test1/transcript.log command (replaced with my own files and directory)
I assume I need to configure it to work for Spanish (the provided script seems to be only for English)
But thanks for confirming this is it. I assumed from your explanation above it was some tool where I can supply the .wav audio file and transcript and manually align them before it splits.