Currently it does: all audio is sent to the model.
However, we are working on turn detection within the framework, so you won't have to send silence to the model when the user isn't talking. It's a fairly straight forward path to cutting down the cost by ~50%.
it works well for voice activity; though it doesn't always detect end-of-turn correctly (humans often pause mid-sentence to think). we are working on improving this behavior.
Can I currently put a VAD module in the pipeline and only send audio when there is an active conversation? Feel like just that would solve the problem?
Speculating here, but I would read this as "anycast" as a concept, where each user is connected to the closest location. versus anycast as in the IP protocol. The complexity far outweighs benefits with routing each UDP packet to different servers within the same session.