Spot on about the TTFT bottleneck. In the voice world, the "thinking" silence is what kills the illusion.
At eboo.ai, we see this constantly—even with faster models, the orchestrator needs to be incredibly tight to keep the total loop under 500-800ms. If Mercury 2 can consistently hit low enough TTFT to keep the turn-taking natural, that would be a game changer for "smart" voice agents.
Right now, most "reasoning" in voice happens asynchronously or with very awkward filler audio. Lowering that floor is the real challenge.
Nice—this is a very pragmatic “works with just TwiML” approach.
A couple questions / thoughts from building voice agents in production:
- How do you handle barge‑in / interruptions? With <Gather input="speech"> + polling, it’s hard to do true full‑duplex + partial ASR. Have you considered a hybrid mode where you keep the TwiML simplicity for setup, but optionally switch to <Stream> (Media Streams) when people want sub‑second turn-taking?
- Twilio’s built-in speech recog is convenient, but in my experience it can be the first thing teams outgrow (accuracy, language coverage, costs, and lack of token-level partials). Do you expose an interface so people can swap STT later without reworking the call control?
- For long agent responses: do you chunk <Say> / keep call alive with <Pause>? Any gotchas around Twilio timeouts while the agent is “thinking”?
We’ve run into the same infra-vs-latency tradeoff at eboo.ai (real-time voice agents / telephony + WebRTC). If you ever want a sanity check on the lowest-latency Twilio path (Media Streams + incremental STT + barge-in), happy to compare notes.
Thanks for the complement.
Yeah, so as the project's readme already explains, this is motivated/influenced by use cases for users who wants lightweight setup for ther openclaw deployment(local VM/VPC) without any complex/heavy setup(TTS/STT) on their openclaw server.
As the project grows and the light-weight path is stable, media-stream support could definitely be a logical next step.
About barge-in/interruptions, we have partial support. You can look at the codebase and/or the documentations we have for architecture, research as well as what's being planned to address etc: https://github.com/ranacseruet/clawphone/tree/main/docs . Feel free to engage on the repo through issue tracking/suggestions etc.
Nice work — real-time voice plumbing always looks “simple” until you build it.
A few things that helped us keep cost + complexity sane on similar voice-agent flows:
- Treat the call as a state machine (collect slots -> confirm -> execute). Don’t let the LLM free-run every turn; use small models for routing/slot-filling, escalate only on ambiguity.
- Put hard guardrails on “thinking”: max tokens/turn + short system prompts. It’s shocking how often cost is prompt bloat + retry loops.
- If you’re using Twilio, Media Streams + a streaming STT/TTS loop reduces latency and avoids “LLM per sentence” patterns.
- Phone-number discovery: try a tiered approach (cached business DB / Places API / fallback scrape) and cache aggressively; scraping every time is where it gets gnarly.
We build production voice agents at eboo.ai and have hit the same Twilio + latency + cost cliffs — happy to share patterns if you want to compare notes.
This is a fascinating challenge. Security by obscurity (like SSH on a non-standard port) definitely has its place as a "first layer," but the prompt injection risk is much more structural.
For those running OpenClaw in production, managed solutions like ClawOnCloud.com often implement multi-step guardrails and capability-based security (restricting what the agent can do, not just what it's told it shouldn't do) to mitigate exactly this kind of "lethal trifecta" risk.
@cuchoi - have you considered adding a tool-level audit hook? Even simple regex/entropy checks on the output of specific tools (like `read`) can catch a good chunk of standard exfiltration attempts before the model even sees the result.
Great work on open-sourcing the orchestrator. Full-duplex and barge-in are definitely the hardest parts to nail—getting those audio buffers cleared and the LLM stream killed in sub-500ms makes or breaks the "human" feel.
Curious about how you're handling VAD in noisy environments—do you find the RMS-based approach holds up well for telephony, or are you considering a more robust model-based VAD (like Silero) for the future?
We're tackling similar low-latency orchestration challenges at eboo.ai. It's great to see more Go-based tools in this space. Subscribed to the repo!
Barge-in is a total nightmare. Clearing those buffers fast enough to kill the 'ghost audio' without the LLM stuttering is exactly what we’re fighting right now.
You're spot on about VAD, too. RMS is our 'MVP debt', it’s fine for clean mics, but we’re definitely looking at a Silero bridge for telephony/noisy environments.
Also, we actually built this because we run Lokutor (ultra-low latency TTS). If you guys at eboo.ai are hunting for faster inference, hit me up—would love to get you a key to play with.
This is a great observation. I'm the creator of OpenClaw, and you've hit on exactly why we recently introduced the "Gateway" architecture.
The early versions were indeed "single programs trying to do everything," which is fine for a demo but fails for long-horizon tasks. The new Gateway architecture (v1.0+) moves us toward the OS model you're describing:
1. Process Supervision: The Gateway acts as a supervisor for multiple agent sessions. If an agent crashes or hangs, the Gateway can detect the heartbeat failure and attempt recovery.
2. State Persistence: We're moving memory and session state into a decoupled database (Clawdb) so you can restart the process without losing context.
3. Event-Driven: Sub-agents can now spawn to handle background work and notify the main session via system events, rather than blocking the main loop.
We're still early in the transition, but the goal is to make OpenClaw the "agentic kernel" that handles the messy reality of failure, rather than just a wrapper around a prompt. Reliability is the main focus for the next few months.
The setup is definitely the biggest hurdle right now. If you're not into the "science project" aspect of local runtimes, the move towards managed hosting or pre-configured hardware (like the Jetson setup mentioned earlier) is the real path to the "transformative" experience.
For me, the value isn't just "chatting with an LLM," but having that LLM possess local context. When an agent can see your real files, monitor your local dev server, and remember your specific preferences across sessions, it stops being a disposable chatbot and starts acting like an actual assistant.
If you're worried about token burn, try a more surgical approach: limit the agent's context to specific project directories and use a "supervisor" model (like the Patch setup mentioned in this thread) to gatekeep the more expensive reasoning calls. It turns the cost from "random drain" into a predictable business expense.
OpenClaw is particularly useful for bridging this gap. Because it's a self-hosted agent with persistent memory (via MEMORY.md and AGENTS.md), it doesn't just "forget" the big picture between sessions.
The "supervisor" workflow mentioned by others in this thread (using one agent to manage multiple worker agents) is exactly where the industry is heading. It turns the human from a "vibe coder" into an architect who manages state and requirements while the agents handle the implementation "beads".
If you're hitting the "stupid zone" on larger tasks, try breaking the plan into smaller, specific markdown specs first. OpenClaw's ability to "interview" a codebase and then implement from those specs in commit-sized chunks is a game changer for non-trivial monorepos.
Two things that bit us building production voice agents:
1) “Barge‑in” feels broken unless you can cancel TTS + LLM immediately (sub‑second) and you treat partial STT hypotheses as first-class signals (not just final transcripts). A simple trick: trigger cancel on any sustained non-silence above a low threshold, then re-enable once you’ve seen N ms of silence.
2) Echo / duplex audio: if you don’t subtract your own TTS audio (or at least gate VAD while TTS is playing), you’ll get false user-starts. Even a crude ‘TTS playing → raise VAD threshold’ helps.
We’re building eboo.ai (voice agents w/ fast barge‑in + streaming orchestration) and ended up with a very similar architecture (telephony + STT + TTS co-located, everything streaming). If you’re curious, happy to compare notes on jitter buffers / geo placement and what’s worked in the wild.
FWIW the RAM number varies a lot depending on what you enable.
If you’re mostly using OpenClaw as a “gateway + chat UI” that calls hosted model APIs, and you’re not running a headful browser / local models / heavy indexing, you can often get by with much less than 4GB.
Where it gets chunky is when each tenant has its own Chromium instance, lots of background workers, or you’re doing anything that keeps long-lived context/caches around. In a multi-tenant setup I’d start conservative, but it’s worth measuring with cgroup limits and seeing what your actual p95 looks like.
It's the damn chromium instances that chugs memory like an Irish man chugs beer. I experimented launching containers with it and quickly realized the tiny shared infra hosts are not a good fit. We're launching a higher prized plan that has more memory for such things.
At eboo.ai, we see this constantly—even with faster models, the orchestrator needs to be incredibly tight to keep the total loop under 500-800ms. If Mercury 2 can consistently hit low enough TTFT to keep the turn-taking natural, that would be a game changer for "smart" voice agents.
Right now, most "reasoning" in voice happens asynchronously or with very awkward filler audio. Lowering that floor is the real challenge.