Is there a more detailed write-up somewhere? I have llama.cpp on a server that I use via a web interface, but what would be the next steps to be able to talk to it? How do you actually connect speech recognition and wake-word on one side, to the server, to speech generation on the other side?
I'm not aware of any detailed write-ups. Mostly gathered information bit by bit.
On a high level here is how it is working for us:
0. When voice assistant device (ESP32) starts, it establishes web-socket connection to the server.
1. ESP32 chip is constantly running wake-word detection (there is one provided out-of-the-box by ESP-IDF framework (by Expressif)
2. Whenever a wake-word is detected (we trained a custom one, but you can use the ones provided by ESP), chip starts sending audio packets to the backend via web-sockets.
3. Backend collects all audio frames until there is a silence (using voice activity detection in Python). As soon as the instruction is over, tell the device to stop listening and:
4. Pass all collected audio segments to speech detection (using python with custom wav2vec). This gives us the text instruction.
5. Given a text instruction, you could trigger locally llama.cpp (or vLLM, if you have a GPU) or call remote API. It all depends on the system. We have a chain of LLM pipelines and RAG that compose our "business logic" across a bunch of AI skills. What's important - there is a text response in the end.
6. Pass the text response to speech-to-text model on the same machine, stream output back to the edge device.
7. Edge device (ESP32) will speak the words or play MP3 file you have sent the url to.
Custom wake-word on a chip is a bit of a pain. So we are running two models. One on the chip and the second, more powerful, on the server. It filters out false positives.