Daniel Imfeld

2023-09-17

🔗

After some Python dependency hassles, I ended up using NVIDIA's FastConformer CTC model for the speech-to-text side of Buzzy. It works great and is roughly 8x faster than Whisper when running without a GPU.

On the speech generation side, I ended up using Mycroft's Mimic 3. There are a number of different voices to choose from, and some sound better than others, but it's super fast and the quality is more than acceptable. I found that many of the voices sound better if you use a lengthScale of 1.2 to slow it down a bit.

Journals for 2023-09-17

2023-09-17