We just shipped a small architectural update for voice AI assistants that fixes a recurring issue where slow backend calls were blocking live conversations.
In most voice agents today, when the assistant needs to query a database, CRM, or third-party API, the entire conversation pauses until that request finishes. Even a few seconds of silence feels broken on a call.
The approach we shipped separates triggering a backend request from delivering the result.
The assistant can fire a webhook asynchronously and continue the conversation immediately. The backend does its work in the background and, when it’s done (seconds later), pushes the result back into the live call using a call control ID. The assistant receives that context mid-call and incorporates it naturally.
This allows for patterns like continuing a conversation while waiting for slow lookups, running multiple backend requests in parallel, and injecting external context into an active call without restarting it.
We ran into something interesting while working on Voice AI systems.
A lot of teams build and test in the US, where calls usually stay in-country, carrier paths are short, and latency is fairly predictable. You can glue together telephony, speech-to-text, a model, and text-to-speech and it behaves well enough.
That same setup often gets reused in EMEA.
On paper, it looks identical. In production, it isn’t.
Calls cross borders much more often, even when callers don’t. Audio traverses more carrier networks. Latency becomes variable rather than just higher. None of this shows up clearly in small pilots, but it starts to matter once you have real traffic.
The AI model usually isn’t the limiting factor. The call path is.
We wrote this up to better understand why Voice AI behaves differently once geography and telecom routing get involved, especially outside the US.
We’ve seen a lot of voice AI demos over the last year that sound impressive in isolation and then fall apart the moment real callers show up.
This piece is an attempt to write down what actually breaks in production voice systems, and why most of those failures are not model-related. Latency, turn-taking, barge-in, state handling, and escalation end up mattering more than prompt quality or model choice once you put traffic on a phone line.
The core argument is that voice AI should be designed as a real-time system with strict constraints, not as a sequence of API calls glued together. We also go into why many teams underestimate latency by measuring it in the wrong place, and how architecture choices quietly define what is even possible conversationally.
This is written from the perspective of building and running these systems, not from a research angle. No claims about “human-like” agents. Mostly lessons learned the hard way.
We work on voice AI infrastructure at Telnyx.
One issue that kept coming up was prompts getting harder to work with over time.
They usually start small. Then edge cases get added. Then tone rules. Then formatting. After a while the prompt still “works,” but nobody wants to edit it because it’s unclear what depends on what.
We added a prompt rewrite feature directly into our voice AI assistant builder.
It takes an existing prompt and rewrites it to be clearer while keeping the same intent and constraints.
You can see the suggested changes inline and decide whether to apply them.
We built this because we kept seeing long prompts cause subtle behavior changes and regressions in production.
We saw many teams prototyping with Retell and then running into the same issues once they tried to scale: variable latency, limited visibility into call failures, and telephony behavior that depends on whichever carrier sits underneath Retell.
Multi-step flows also get hard to maintain because everything lives in one prompt tree.
We built a way to import Retell agents into Telnyx so teams can test the same logic on a different infrastructure. You connect your Retell account, pull your agents, and Telnyx converts them into native assistants. Multi-prompt agents get split into separate components with clear handoff logic.
Most teams who try this mainly want to compare latency, barge-in behavior, and debugging visibility. Since Telnyx runs telephony, STT, LLMs, and TTS in the same PoP, the audio loop stays short, which tends to matter once real callers get involved.
If anyone here has experience moving voice agents between platforms or running into infra bottlenecks in Retell, I would be interested in hearing what problems you saw.
Most providers quote one blended number like “$0.10/min”.
That makes it hard to answer basic questions:
- Did my call cost more because the model used more tokens?
- Because the call sat in silence?
- Because the PSTN leg was expensive?
- Because recording was enabled?
So we shipped Call Cost Breakdowns for Voice AI Assistant calls.
For each call, you can see the cost per component, down to fractions of a cent:
- PSTN or SIP termination
- WebRTC / media
- call control
- recording
- LLM inference (tokens, cached tokens, completions)
- TTS usage
It’s live in the Telnyx Portal and in the API/webhooks, so you can pull it into your own billing and alerts. chekout relaese notes of this feature here: Release notes: https://telnyx.com/release-notes/voice-ai-call-cost-breakdow...