Viewing a single comment thread. View all comments

TFenrir t1_j9v3joj wrote

A lot of it has to do with computational intensity and latency. Text to audio and vice versa takes a bit of time - and different challenges for local or cloud based solutions.

Let's say you want chatbot to real-time reply to you in audio, with a cloud based solution.

First you speak to it in audio, and that is sent to a cloud server - this part is relatively fast, and what already happens with things like Google home/Alexa. Then it needs to convert it to text, and run that text through an LLM. Then the LLM creates a response, and that needs to be converted to audio.

Let's say for a solution like we see with elevenlabs, it takes 2 seconds for every second of audio you want to generate. That means if the reply is going to be 10 seconds, it takes 20 seconds to generate. That would be too slow.

You might have opportunity to stream that audio, by only generating some of the text to audio before starting the process, but these solutions work better when they are given more text to generate all at once... Generating a word at a time would be like talking with A. Period. In. Between. Every. Word.

3