A Universal Benchmark

in #blog3 days ago

I think you don't want to be slower than that, but you don't also want to be faster than that. People are scared off because they think that, oh, you're pretty and then you're scary. Like, you answer my question ahead of time.

Source

But if it's slower than maybe like, you know, 800 milliseconds and people will confuse and try to ask the same question again, you know, knowing that maybe it's a network issue or maybe you didn't set quite the voice right. So I think 500 milliseconds is kind of like a universal benchmark that we're trying to hit on among all the features. That's kind of like my.

And then, yeah. And then would you consider your whatever latency you have today? How would you compare that with other similar apps like chat, GPT, voice to voice? Like, have you tried comparing the two? Yeah, so we did have a technology we call kernel that we started working on this pretty early, more than two years.

That would basically establish a streaming model, because if you think about why there's a latency.

So, if you, if you press this button, the microphone starts recording, you're recording in an audio file, and that audio file needs to be converted into strings and strings send it to the dictation engine or TTS text to speech. Sorry, speech to text, actually speech to text engine and convert to text, and then that text send into open API or whatever large model for intentional understanding and start generating based on their speed. And then it's a round trip, right? This is a single trip and everything reverse again.

So, if you add all this together, if you just go there and build a voice AI with no optimization based off GPT four, we know for a fact that single dialogue, you're looking at probably five to six seconds. But we made a streaming model to where we basically cut off the chunks into a very, very small timestamp chunks. And we'll make the entire model streaming.