After some Python dependency hassles, I ended up using NVIDIA's FastConformer CTC model for the speech-to-text side of Buzzy. It works great and is roughly 8x faster than Whisper when running without a GPU.
On the speech generation side, I ended up using Mycroft's Mimic 3. There are a number of different voices to choose from, and some sound better than others, but it's super fast and the quality is more than acceptable. I found that many of the voices sound better if you use a
lengthScale of 1.2 to slow it down a bit.