May 22, 2025 · Engineering · 8 min read

Under the Hood: Achieving Sub-100ms Translation Latency

A deep dive into the architecture decisions — streaming ASR, parallel inference, and edge routing — that make Parley feel instantaneous, even across continents.

Why latency is everything in phone translation

Real-time translation is a latency-sensitive problem unlike almost any other in software. In a text chat, a two-second delay is annoying. In a live phone call, it makes conversation impossible. Humans speak at roughly 120–180 words per minute, which means a natural pause between sentences is only 400–800 milliseconds.

To fit translation inside that window — with time left over for network transit — you need end-to-end processing under 150ms. We target under 100ms to give ourselves margin.

"The moment latency exceeds 300ms, callers instinctively speak over each other. The conversation breaks down."

The pipeline: three stages, each optimized hard

1. Streaming automatic speech recognition (ASR)

Traditional ASR waits for the end of an utterance, then transcribes it. That approach adds 500ms–1s of delay before we even start translating. We use a streaming ASR model that outputs word-level transcriptions as they're spoken — word by word, with partial results.

The challenge is accuracy. Partial results are inherently less reliable than end-of-utterance results, because context hasn't fully resolved. We handle this with a confidence-weighted buffering system: low-confidence tokens are held, high-confidence tokens are passed downstream immediately.

2. Simultaneous interpretation vs. sequential translation

Traditional machine translation takes a complete sentence and translates it. Simultaneous interpretation — what human interpreters at the UN do — begins translating before the sentence ends, predicting how it will finish.

We use a transformer-based model fine-tuned for incremental translation. It processes input tokens as they arrive and outputs target-language tokens in real time, using a small future-context window (roughly 0.3 seconds of lookahead) to handle reordering between languages with different word orders, like English and Japanese.

3. Edge-first deployment

Even a perfect model adds nothing if the audio has to travel from a caller in Berlin to a data center in Virginia and back. We co-locate translation compute at 18 edge regions across North America, Europe, and Asia-Pacific. Every call is routed to the nearest inference node.

For most callers in major cities, round-trip model latency is under 30ms. Combined with ASR and synthesis time, total latency — from the speaker's lips to the listener's ear — lands between 70–110ms in typical conditions.

Text-to-speech: the final mile

The translated text still needs to be spoken. Our TTS synthesis runs in parallel with translation — we start generating audio for early tokens while later tokens are still being translated. By the time the full sentence is translated, the first two seconds of audio are already rendered and buffering for playback.

What we still can't solve (yet)

Some language pairs with very different syntactic structures — notably SOV languages like Japanese and Korean paired with SVO languages like English — still require longer lookahead to produce natural-sounding translations. We're currently at 95ms median latency for EN↔ES and EN↔FR, but EN↔JA sits at 140ms median. We're working on it.

Open benchmarks

We publish our latency benchmarks publicly, measured across 50 language pairs at three times daily. You can view them at parley.ai/benchmarks. We believe transparency builds trust — and keeps us honest about where we still have work to do.