Pipeline structure
Audio flows left to right. The Audio Buffer sits after transport output so it captures exactly what was played to the customer.Services
STT (speech-to-text)
| Provider | Model | Notes |
|---|---|---|
| Deepgram | Nova-3 | Default. Real-time streaming. |
| Soniox | stt-rt-v4 | Alternative provider. |
stt.provider and stt.model in bot config. Extra options (endpointing, smart formatting, language hints) pass through stt.extra.
LLM (language model)
| Provider | Example models | Notes |
|---|---|---|
| Google (Gemini) | gemini-2.5-flash | Default. Via Google Generative AI API. |
| OpenAI | gpt-4.1-mini | Via OpenAI API. |
| Google Vertex AI | gemini-2.5-flash | Requires project_id in llm.extra. |
end_call, transfer_call, detected_voicemail) and custom tools defined in the bot config.
TTS (text-to-speech)
| Provider | Example models | Notes |
|---|---|---|
| ElevenLabs | eleven_flash_v2_5 | Default. Multilingual models require language code. |
| Sarvam | bulbul:v2 | Indian language support. |
| Tarang | — | Custom HTTP-based TTS. |
VAD and turn detection
Voice Activity Detection (VAD): Silero ONNX model detects when the customer is speaking. Configurable viavad in bot config (confidence threshold, start/stop timing, minimum volume).
Turn detection: SmartTurnV3 ONNX model determines when the customer has finished their turn, triggering the LLM response.
Interruptions: Enabled by default. When the customer speaks over the bot, TTS output is cancelled and the LLM processes the new input. min_words_interruption (default: 3) prevents accidental interruptions from short utterances.
Built-in functions
The LLM has access to three built-in functions:| Function | Behavior |
|---|---|
end_call() | Bot speaks any final message, then hangs up. Sets disconnected_by = "bot". |
transfer_call(reason) | Speaks pre-transfer message, then transfers to configured number. |
detected_voicemail() | Speaks voicemail message (if configured), then hangs up. Sets disconnected_by = "voicemail". |
Re-engagement (dead air handling)
When the customer goes silent mid-call, VoxCore prompts them to respond:- After
gap_secondsof silence, speak a re-engagement message (shuffled, non-repeating) - Reduce the gap for subsequent attempts
- After
max_retriesexceeded, end the call withdisconnected_by = "RNR"
re_engagement in bot config: messages (list of prompts), gap_seconds (int or [first, subsequent]), max_retries.
Recording
Audio is captured by anAudioBufferProcessor placed after transport output. On call end, the buffer is encoded as WAV and uploaded to object storage (DigitalOcean Spaces). The recording URL and storage key are included in call results.
Max duration
Ifmax_call_duration_seconds is set in bot config (default: 600), the pipeline automatically ends the call when the limit is reached. Sets disconnected_by = "timeout".
Latency tracking
Per-turn latency is measured across three stages:| Metric | What it measures |
|---|---|
stt_ms | Time-to-first-byte from STT processor |
llm_ms | Time-to-first-byte from LLM processor |
tts_ms | Time-to-first-byte from TTS processor |
total_ms | End-to-end response latency |