Pipeline - Vox Systems

Every call runs through the same core pipeline regardless of route. The route owns the connection details; the pipeline factory assembles Pipecat processors based on runtime bot configuration from VoxBridge.

Pipeline structure

Audio flows left to right. The Audio Buffer sits after transport output so it captures exactly what was played to the customer.

For the communication-quality controls around listening, interruptions, silence, tool timing, grounding, and call telemetry, see Communication quality.

Services

STT (speech-to-text)

Provider	Model	Notes
Deepgram	Nova-3	Default. Real-time streaming.
Deepgram Flux	flux-general-en / flux-general-multi	Uses Deepgram `/v2/listen` and external turn detection.
Soniox	stt-rt-v4	Alternative provider.

Configured via stt.provider and stt.model in bot config. Extra options pass through stt.extra:

Deepgram Nova: endpointing, smart_format, punctuate, interim_results
Deepgram Flux: eot_threshold, eager_eot_threshold, eot_timeout_ms, keyterm, min_confidence, language_hints, should_interrupt
Soniox: language_hints, language_hints_strict, context, enable_speaker_diarization, enable_language_identification, client_reference_id, vad_force_turn_endpoint

language_hints for Deepgram Flux is only applied when stt.model = "flux-general-multi". should_interrupt defaults to true.

For Soniox, VoxCore defaults vad_force_turn_endpoint to false (Pipecat’s library default is true). With force-turn-endpoint on, Silero VAD stop events finalize Soniox mid-turn, which wedges turns when soft Hindi/Hinglish speech sits below VAD min_volume and re-engagement never fires. Operators can flip it back per bot via stt.extra.vad_force_turn_endpoint = true. See create_stt_service() in src/voxcore/pipeline/stt_factory.py.

LLM (language model)

Provider	Example models	Notes
Google (Gemini)	gemini-2.5-flash	Default. Via Google Generative AI API.
OpenAI	gpt-4.1-mini	Via OpenAI API.
Google Vertex AI	gemini-2.5-flash	Requires `project_id` in `llm.extra`.

The LLM receives the system prompt and conversation context. It can call built-in functions (end_call, transfer_call, detected_voicemail, search_knowledge) and custom tools defined in the bot config.

TTS (text-to-speech)

Provider	Example models	Notes
ElevenLabs	eleven_flash_v2_5	Default. Multilingual models require language code.
Sarvam	bulbul:v2	Indian language support.
Soniox	—	`provider: "soniox"`. ~20-language map (English, Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, plus Spanish/French/German/Arabic/Portuguese/Japanese/Korean/Chinese/Russian/Italian). `voice` and `model` optional.

Configured via tts.provider, tts.model, and tts.voice_id. See create_tts_service() in src/voxcore/pipeline/tts_factory.py for the per-provider settings passthrough. ElevenLabs can be wrapped with Redis-backed TTS caching when tts.cache_config.enabled=true and TTS_CACHE_REDIS_URL is configured.

VAD and turn detection

Voice Activity Detection (VAD): Silero ONNX model detects when the customer is speaking. Configurable via vad in bot config (confidence threshold, start/stop timing, minimum volume). Turn detection: SmartTurnV3 ONNX model determines when the customer has finished their turn, triggering the LLM response. Deepgram Flux: Flux has external turn detection, so VoxCore uses Flux turn events instead of VAD-based user-turn strategies for stt.provider = "deepgram_flux". Interruptions: Enabled by default. When the customer speaks over the bot, TTS output is cancelled and the LLM processes the new input. min_words_interruption (default: 3) prevents accidental interruptions from short utterances.

Built-in functions

The LLM has access to built-in functions:

Function	Behavior
`end_call()`	Bot speaks any final message, then hangs up. Sets `disconnected_by = "bot"`.
`transfer_call(reason)`	Speaks pre-transfer message, then transfers to configured target. Native transfer-state is derived from pipeline events and added to the call results — see Transfer and escalation.
`detected_voicemail()`	Speaks voicemail message (if configured), then hangs up. Sets `disconnected_by = "voicemail"`.
`search_knowledge(query)`	Searches attached VoxBridge knowledge bases when RAG is enabled for the bot.

Custom tools defined in bot config are also registered. Custom tools can run immediate, speak-then-run, speak-and-run-parallel, or terminal-after-speech policies. Tool calls and results are logged in call events and turn metrics.

Re-engagement (dead air handling)

When the customer goes silent mid-call, VoxCore prompts them to respond:

After gap_seconds of silence, speak a re-engagement message (shuffled, non-repeating)
Reduce the gap for subsequent attempts
After max_retries exceeded, end the call with disconnected_by = "RNR"

Configured via re_engagement in bot config: messages (list of prompts), gap_seconds (int or [first, subsequent]), max_retries.

Call events

The pipeline appends structured events to the call results as the conversation progresses. These are the primary tool for debugging why a call felt slow, wedged, or force-closed. All emitted from create_pipeline() in src/voxcore/pipeline/factory.py.

Service errors

When the pipeline raises an error frame, VoxCore records a service_error event:

Field	Meaning
`service`	Classified as `llm`, `tts`, `stt`, or `unknown`. Derived from the failing processor’s class name, falling back to matching provider names in the error message.
`processor`	Processor class name that raised the error.
`message`	Error string (truncated to 500 chars).
`fatal`	`true` when the error frame is fatal (terminates the call).

User-turn lifecycle

These events trace how each customer turn was detected and closed — essential for diagnosing dead-air and force-closed-turn bugs.

Event	Meaning
`user_turn_started`	A user turn began. Carries `strategy` (turn strategy class name, or `null`). Resets the re-engagement retry counter.
`user_turn_inference_triggered`	A turn strategy fired and the LLM was triggered. Carries `strategy`.
`turn_stop_timeout`	No stop strategy fired before the timeout; the turn was force-closed by the watchdog without inference.
`user_turn_stopped`	The turn ended. Carries `strategy`, `inference_triggered` (whether the LLM was triggered), and `had_content` (whether the turn had transcript text).

A user_turn_stopped with strategy = null, inference_triggered = false, and had_content = true means the turn was force-closed by the stop-timeout watchdog and its transcript was discarded without reaching the LLM. This is the signature of the dead-air bug.

Recording

Audio is captured by an AudioBufferProcessor placed after transport output. On call end, the buffer is encoded as WAV and uploaded to S3-compatible object storage: DigitalOcean Spaces or MinIO depending on runtime config. The recording URL and storage key are included in call results.

Max duration

If max_call_duration_seconds is set in bot config (default: 600), the pipeline automatically ends the call when the limit is reached. Sets disconnected_by = "timeout".

Latency tracking

Per-turn latency is measured and collected as samples (latency_samples):

Per-turn sample	What it measures
`stt_ms`	Time-to-first-byte from STT processor
`llm_ms`	Time-to-first-byte from LLM processor
`tts_ms`	Time-to-first-byte from TTS processor
`tool_ms`	Custom tool execution latency, when tools ran
`rag_ms`	Knowledge search latency, when RAG ran
`total_ms`	End-to-end response latency

At call end these samples are averaged into the latency object on the call results (LatencyData in src/voxcore/models/results.py, populated by compute_avg_latency() in src/voxcore/routes/_post_call.py). The averaged fields are named distinctly from the per-turn samples:

Averaged field	Source sample
`stt_avg_ms`	mean of `stt_ms`
`llm_avg_ms`	mean of `llm_ms`
`tts_avg_ms`	mean of `tts_ms`
`tool_avg_ms`	mean of `tool_ms`
`rag_avg_ms`	mean of `rag_ms`
`total_avg_response_ms`	mean of `total_ms`

Both the raw latency_samples list and the averaged latency object are included in the final call results.

Live prompt caching

For non-policy bots, the static portion of the system prompt can be served from a provider-side prompt cache to cut LLM cost and time-to-first-token on the live call. Gemini/Vertex use explicit CachedContent; OpenAI uses prompt_cache_key/prompt_cache_retention request hints. VoxCore reports per-call cache hit/miss metrics on the first live llm usage entry (cache.namespace = "live_prompt"). See Caching for the full design, including how static and dynamic prompt portions must be split and the lifecycle of the Gemini cache registry.

​Pipeline structure

​Services

​STT (speech-to-text)

​LLM (language model)

​TTS (text-to-speech)

​VAD and turn detection

​Built-in functions

​Re-engagement (dead air handling)

​Call events

​Service errors

​User-turn lifecycle

​Recording

​Max duration

​Latency tracking

​Live prompt caching

Pipeline structure

Services

STT (speech-to-text)

LLM (language model)

TTS (text-to-speech)

VAD and turn detection

Built-in functions

Re-engagement (dead air handling)

Call events

Service errors

User-turn lifecycle

Recording

Max duration

Latency tracking

Live prompt caching