Prompt caching - Vox Systems

VoxCore caches the static portion of LLM prompts so that the model does not re-process the same system instructions (and tool specs) on every turn. Two independent caches share the same registry/Redis pattern:

Live prompt cache

Caches the static system prompt + tools used during the live call. Cuts per-turn latency and input-token cost.

Post-call context cache

Caches the analysis/QC system prompt for post-call LLM passes. Cuts post-call input-token cost.

Both are Gemini/Vertex-only explicit CachedContent. OpenAI uses request-level hints instead of the Gemini cache registry.

Why

A bot’s system prompt — policy, compliance rules, objection handling, tool specs — is large and identical across every turn and every call for that bot version. Re-sending it each turn wastes input tokens and adds latency. Caching the static block lets the provider charge the cheaper cached-input rate and skip re-tokenizing it.

Never cache a fully rendered system prompt that contains per-customer data. Reuse is poor and dynamic context can leak across calls. The prompt must be split into a static block and a fresh runtime block first.

Prompt split

The split is configured per bot via PromptPartsConfig (prompt_parts in bot config, src/voxcore/models/config.py):

Field	Purpose
`mode`	`legacy` (no split, no cache) or `direct` (split prompt, cache eligible)
`cache_enabled`	Master toggle for live prompt caching on this bot
`static_system_prompt`	The cacheable block: policy, compliance, flow, tool-usage rules, disposition schema
`dynamic_runtime_prompt`	Fresh per-call block: customer variables, CRM fields, dates, attempt data
`static_version`	Cache version — bump to invalidate when the static prompt changes
`openai_prompt_cache_key`	OpenAI cache routing key (OpenAI only)
`openai_prompt_cache_retention`	`in_memory` or `24h` (OpenAI only)

Live caching only engages when mode == "direct" and cache_enabled is true (src/voxcore/pipeline/factory.py).

Direct-prompt bots carry both system_prompt and static_system_prompt. When the live cache is active, VoxCore builds the live LLM context from the static block plus dynamic_runtime_prompt — not the legacy system_prompt. Keep both in sync.

Live prompt cache

Gemini / Vertex

For google and google_vertex providers, VoxCore resolves a CachedContent name and injects it before the LLM service is created (src/voxcore/pipeline/factory.py):

generation_config["cached_content"] = live_prompt_cache_result.cached_content_name
generation_config["tools"] = None
generation_config["tool_config"] = None

When cached_content is active, the context is built with tools=NOT_GIVEN and system_instruction is omitted per-request — Gemini requires that tools and the system instruction live inside the cached content, not in the request.

Pipecat builtin-tool caveat. With tools=NOT_GIVEN, Pipecat’s BaseLLMAdapter.from_standard_tools skips its builtin-tool injection entirely (it only merges builtin tools when the input is a ToolsSchema). VoxCore is safe today because every tool (end_call, transfer_call, detected_voicemail, search_knowledge, custom tools) is registered via register_function(...), not as a Pipecat builtin tool. If a future Pipecat upgrade introduces a builtin tool we need, add its spec inside CreateCachedContentConfig.tools in live_prompt_cache.py:_convert_tools_for_gemini — do not re-enable per-request tools; Gemini rejects that combination. Verified against Pipecat 1.2.1.

OpenAI

OpenAI does not use the Gemini registry. When the bot is direct-mode + cache-enabled and openai_prompt_cache_key is set, VoxCore passes request hints (src/voxcore/pipeline/llm_factory.py):

prompt_cache_key — routing key for the provider’s automatic prefix cache
prompt_cache_retention — in_memory or 24h

No CachedContent is created; VoxCore relies on the provider’s automatic prefix caching and its reported cached-token usage.

Cache resolution order

get_or_create_live_prompt_cache (src/voxcore/pipeline/live_prompt_cache.py) resolves the cache name in this order:

Prewarmed name

If VoxBridge supplied a prewarmed live_prompt_cache_state.cache_name, it short-circuits everything (no registry, Redis, or Vertex create). Status hit, source prewarmed.

In-process registry

A bounded OrderedDict (max 512 entries) keyed by provider + client identity + model + prompt hash + version + tools hash. A fresh entry is returned directly.

Redis

Shared across all 16 workers via voxcore:live_prompt_cache: keys. A hit is promoted into the local registry so siblings skip creation.

Create

Otherwise VoxCore creates a new Gemini CachedContent (static prompt as system_instruction, converted tools, TTL), then writes it to the registry and Redis.

Entries are proactively refreshed once they pass ~90% of their TTL (_LIVE_PROMPT_CACHE_REFRESH_RATIO = 0.9); least-recently-used entries are evicted when the registry is full.

Lifecycle ownership

VoxBridge owns the scheduled lifecycle; VoxCore is a consumer that prefers VoxBridge’s prewarmed name and self-heals when it expires.

Responsibility	Owner	Reference
7am prewarm (default `prewarm_hour=7`)	VoxBridge	`services/live_prompt_cache_lifecycle.py`
11pm cleanup (default `cleanup_hour=23`, disabled bots only)	VoxBridge	`services/live_prompt_cache_lifecycle.py`
10h TTL (default `ttl_hours=10`)	VoxBridge	`services/live_prompt_cache_lifecycle.py`
Missed-run catch-up	VoxBridge	`_catch_up_missed_runs`
Inline single-flight recreate	VoxBridge	`routes/internal_live_cache.py` (`/recreate`)
Audit events	VoxBridge	`services/live_prompt_cache_audit.py`
Prefer prewarmed name, in-process registry, Redis sharing, near-TTL refresh	VoxCore	`pipeline/live_prompt_cache.py`

Cache refresh is version-based, not delete-based. To force a refresh, bump static_version (which changes the registry key) and invalidate the VoxBridge bot-config cache. Do not try to delete Gemini caches across all workers — old caches age out by TTL. Cleanup deliberately targets disabled bots only, because deleting an enabled bot’s cache nightly created a dead window between cleanup and the next prewarm.

Cache-expiry recovery

A cached content name can become unusable mid-call — TTL expiry (400 INVALID_ARGUMENT ... is expired) or deletion/aging-out (404 ... cached content metadata ... not found). Both are recoverable. GoogleLLMServiceWithCacheRecovery / GoogleVertexLLMServiceWithCacheRecovery (src/voxcore/pipeline/google_llm_with_cache_recovery.py) handle it:

Detect during iteration

The Gemini stream is lazy — the error surfaces while iterating the response, not when the stream is awaited. VoxCore wraps the iteration itself so the error is caught.

Evict the stale name

Removes it from the local registry and Redis via invalidate_live_prompt_cache.

Fetch a replacement

Reads the current prewarmed name from VoxBridge (live_prompt_cache_state.cache_name); if missing or unchanged, POSTs to VoxBridge /recreate for an inline single-flight create.

Swap for the next turn

Mutates self._settings.extra["generation_config"]["cached_content"]. Pipecat re-reads self._settings.extra per _stream_content call, so the next turn on the same call uses the new cache.

Customer experience: the current turn fails (a moment of silence on one turn), but the call is not dropped — the next turn continues on the fresh cache. The alternative (failing the call) would break every call referencing the expired cache. Audit events (expired_in_call, swap_after_expiry) are posted to VoxBridge fire-and-forget.

Post-call context cache

Post-call analysis, QC, and callback extraction share the same registry/Redis pattern in src/voxcore/processors/post_call.py (voxcore:post_call_cache: keys, max 512 entries, 90% TTL refresh).

Configuration

Enable per bot via either the legacy flat fields or the nested post_call_cache dict (_post_call.py reads both):

Field	Purpose
`post_call_cache_enabled`	Legacy flat toggle
`post_call_cache_version`	Legacy flat version
`post_call_cache.enabled`	Nested toggle (overrides flat)
`post_call_cache.analysis_version`	Per-namespace version for analysis
`post_call_cache.qc_version`	Per-namespace version for QC

The cached system_instruction is the prompt without the injected per-call date context — the date block is sent as fresh content so the cache key stays stable across calls (cache_system_prompt vs system_prompt in post_call.py).

VoxBridge computes the version from raw analysis/QC prompt templates, not rendered per-call values. Invalidation is version-based: bump the version, the registry key changes, and the next call creates a fresh CachedContent. Never delete-based.

OpenAI

Post-call context caching is not supported on OpenAI. When cache_enabled is set on an OpenAI post-call call, the usage cache metadata reports status unsupported_provider with reason openai_post_call_cache_not_supported (_run_openai).

Stale retry

If a cached generate fails (expired cache), the post-call path evicts the registry entry and retries once with system_instruction inline (no cache). The usage cache metadata records status stale_retry.

Visibility

Every LLM usage entry carries a cache dict (UsageEntry.cache in src/voxcore/models/results.py) with enabled, status, namespace, version, reason.

Namespace	Where attached
`live_prompt`	First live `llm` usage entry only
`post_call_analysis`	Post-call analysis pass
`qc_analysis`	QC pass
`callback_extraction`	Callback-detection pass (caching disabled)

Cache status values include disabled, hit, created, fallback, ineligible, unsupported_provider, stale_retry.

OpenAI live hit/miss is derived only from real cache_read_input_tokens (hit when > 0, miss when 0). Do not invent estimated tokens or money saved. Gemini/Vertex cache-creation tokens come from the cache’s usage_metadata.

VoxBridge’s build_llm_cache_summary (services/call_service.py) is scoped to post-call/QC only (type in {"post_call", "qc"}) so dashboard aggregate cards are not polluted by live-conversation cache data. The summary reports token facts only, never money. VoxUI Call Detail renders separate Live Prompt Cache and Post-Call Cache sections.

Bot configuration

prompt_parts, post_call_cache, and related fields.

Pipeline

Where the LLM service and live cache are wired into the pipeline.

Live prompt cache

Post-call context cache

​Why

​Prompt split

​Live prompt cache

​Gemini / Vertex

​OpenAI

​Cache resolution order

​Lifecycle ownership

​Cache-expiry recovery

​Post-call context cache

​Configuration

​OpenAI

​Stale retry

​Visibility

​Related

Bot configuration

Pipeline

Why

Prompt split

Live prompt cache

Gemini / Vertex

OpenAI

Cache resolution order

Lifecycle ownership

Cache-expiry recovery

Post-call context cache

Configuration

OpenAI

Stale retry

Visibility

Related