Buzzy
Written
— Updated
- Experimental AI bot to talk with my kids and answer questions.
- The ecosystem around voice-based chats has improved a lot since I started this project. On hold for now, but I'll probably start this up again later in 2024 and just use more services instead of trying to do it all on CPU.
- https://github.com/dimfeld/buzzy
Task List
Up Next
Soon
- Basic intent detection
- Run web searches and generate an answer from the results.
- read system message from a file
- record llm pipeline actions for later analysis
Later
- Check out whisper-turbo for in-browser voice recognition
- optional configuration to use better models that require GPU
Done
- set up basic ChatGPT workflow
- Voice recognition
- Basic TTS
- Websocket-based communication
- Stream results back to client
System Prompt Example
- You are Buzzy, an AI bot that answers questions for children. Your answers should be appropriate for a smart six year old boy, but also don't dumb your answers down too much.
Ideas
- Do I want to do a RAG-based conversation learning/memory?
- Decide whether to send back previous chat messages based on how much time has passed.
- Voice recognition to ask questions
- Choice:
- Decision: nvidia/stt_en_fastconformer_transducer_large
- Options:
- Run Whisper in browser?
- Whisper on server?
- Too slow when running on CPU
- nvidia/stt_en_fastconformer_transducer_large
- runs fast (~500ms for short passages) and works well
- Needs to run ok on just CPU
- Seems to work best to just send the audio in one big chunk to the server.
- Huggingface Voice Recognition Leaderboard
- Choice:
- TTS to say responses
- How easy is it to use one of the new models for this?
- Choice:
- Decision: Mimic 3
- Options:
- Bark seems promising
- XTTS
- https://github.com/coqui-ai/TTS/blob/dev/docs/source/models/xtts.md
- Sounds the best, but has RTF of 4-5 on CPU
- Mimic 3
- This turns out to be the only solution that both sounds good and runs in realtime on CPU
- Running with lengthScale 1.2 slows it down a bit and seems to give best results
- Some kind of 3D avatar that's like a dinosaur or robot or something?
- Scrensaver mode that does a photo carousel
Intent Detection
- DeBERTa-v3-base-mnli-fever-anli seems to work well for this at first try. Haven't really exercised it significantly yet though. The creator of that model now also has deberta-v3-large-zeroshot-v1.
- Tasks
- Figure out if something is a question that can be answered by searching the web and/or wikipedia
- How many days until...
- Show me pictures of...
- When doing web search and intent , also need to detect if a query builds on the previous queries or not.
- "No, the blue one" has no context but the context is probably in the previous message.
- Small models don't seem to do great on this but gpt-3.5-turbo-instruct does well.
Assistant: {assistant's last message} User: {user's question} Does the user's question ask for clarification on the assistant's statement? Only answer yes or no.
- For this we can probably just pair up the last assistant message with the latest query, since they tend to include all the necessary info again in every message. Then use GPT to create a proper search for it.
- Do we even need to do the detection? Will it work to just ask GPT to make a search for the query?
- Takes some tweaking of the prompt but this seems to work well.
This is an excerpt of a chat with an assistant and a user: Assistant: {assistant's last message} User: {user's question} What would be a good web search to answer the user's question? If the question is asking for clarification on the assistant's statement, then the web search should account for that. If it is a new line of questioning, then ignore the assistant's statement. Respond only with the web search and nothing else. Web search:
Web Search
- Use Brave Search API to do web searches to answer questions
- Should searches be an intent, or should we run searches for anything that doesn't return another intent?
- Maybe also wikipedia/wikidata?
- Use Brave Search API to do web searches to answer questions