captions are submitted after the initial transcript is created, which means the text submitted is the entire stream of text between one speaker and the next...
ie. <speaker1: this whole block of text will be sent as a single block for additional punctuation>
<speaker2: this whole block of text will be sent next>
Well, if you want to go that far, there are some programs I've used that can remove such extraneous "noise", and do quite an amazing job at it. The best ones, however, tend to be quite expensive, and they generally mute/muffle based on a noise print, as opposed to outright cutting text.
For longer words, or even filler phrases, that also starts to enter much more subjective territory. Either way, it still makes the case for pre-filtering the audio tracks before processing to remove as much of this extraneous "noise" as possible. It seems you may have also missed that aspect of my post....
Edit the resulting file by removing redundant words
Why don't you remove all the extraneous noise and filler words using a multitrack audio editor over all tracks simultaneously before running through text-to-speech?
Well it just makes no sense prefiltering when with Speech-to-Text you could see where the actual words are in time and filter the noise out accordingly. That's what you can see in the YouTube web app but it doesn't let you edit the audio unfortunately. But if any mulitrack editor could have the words alongside it like YT web app does that would be excellent.
just wanted to add these few links we discussed for reference:
Link: How to Use Truncate Silence and Sound Smarter with Audacity
Link: Howto Truncate Silence in Audacity
Link: Deep Learning 'ahem' detector (github project)
not sure I understand this, but exporting YouTube caption via WebVTT gaves me the individual timings of each word shown in the auto-captioning. The multitrack editing (cutting across all tracks) phase would take place BEFORE any captioning takes place, and you would have already removed any of those undesirable "artifacts".
Interestingly though you might be able to do it that way too in reverse, get combined transcript and convert into audacity labels.. http://wiki.audacityteam.org/wiki/Movie_subtitles_(*.SRT)
Even if all the artifacts aren't there, maybe it makes it easier to find some of them to cut out.
btw, also added a feature to
srt2vtt
that converts caption files to the audacity label format:outputs audacity-compatible text labels from captions file