Journals for



After some Python dependency hassles, I ended up using NVIDIA's FastConformer CTC model for the speech-to-text side of Buzzy. It works great and is roughly 8x faster than Whisper when running without a GPU.

On the speech generation side, I ended up using Mycroft's Mimic 3. There are a number of different voices to choose from, and some sound better than others, but it's super fast and the quality is more than acceptable. I found that many of the voices sound better if you use a lengthScale of 1.2 to slow it down a bit.

Thanks for reading! If you have any questions or comments, please send me a note on Twitter.

Please also consider subscribing to my weekly-ish newsletter, where I write short essays, announce new articles, and share other interesting things I've found.