Live prompt cache
Caches the static system prompt + tools used during the live call. Cuts per-turn latency and input-token cost.
Post-call context cache
Caches the analysis/QC system prompt for post-call LLM passes. Cuts post-call input-token cost.
CachedContent. OpenAI uses request-level hints instead of the Gemini cache registry.
Why
A bot’s system prompt — policy, compliance rules, objection handling, tool specs — is large and identical across every turn and every call for that bot version. Re-sending it each turn wastes input tokens and adds latency. Caching the static block lets the provider charge the cheaper cached-input rate and skip re-tokenizing it.Prompt split
The split is configured per bot viaPromptPartsConfig (prompt_parts in bot config, src/voxcore/models/config.py):
| Field | Purpose |
|---|---|
mode | legacy (no split, no cache) or direct (split prompt, cache eligible) |
cache_enabled | Master toggle for live prompt caching on this bot |
static_system_prompt | The cacheable block: policy, compliance, flow, tool-usage rules, disposition schema |
dynamic_runtime_prompt | Fresh per-call block: customer variables, CRM fields, dates, attempt data |
static_version | Cache version — bump to invalidate when the static prompt changes |
openai_prompt_cache_key | OpenAI cache routing key (OpenAI only) |
openai_prompt_cache_retention | in_memory or 24h (OpenAI only) |
mode == "direct" and cache_enabled is true (src/voxcore/pipeline/factory.py).
Direct-prompt bots carry both
system_prompt and static_system_prompt. When the live cache is active, VoxCore builds the live LLM context from the static block plus dynamic_runtime_prompt — not the legacy system_prompt. Keep both in sync.Live prompt cache
Gemini / Vertex
Forgoogle and google_vertex providers, VoxCore resolves a CachedContent name and injects it before the LLM service is created (src/voxcore/pipeline/factory.py):
cached_content is active, the context is built with tools=NOT_GIVEN and system_instruction is omitted per-request — Gemini requires that tools and the system instruction live inside the cached content, not in the request.
OpenAI
OpenAI does not use the Gemini registry. When the bot is direct-mode + cache-enabled andopenai_prompt_cache_key is set, VoxCore passes request hints (src/voxcore/pipeline/llm_factory.py):
prompt_cache_key— routing key for the provider’s automatic prefix cacheprompt_cache_retention—in_memoryor24h
CachedContent is created; VoxCore relies on the provider’s automatic prefix caching and its reported cached-token usage.
Cache resolution order
get_or_create_live_prompt_cache (src/voxcore/pipeline/live_prompt_cache.py) resolves the cache name in this order:
Prewarmed name
If VoxBridge supplied a prewarmed
live_prompt_cache_state.cache_name, it short-circuits everything (no registry, Redis, or Vertex create). Status hit, source prewarmed.In-process registry
A bounded
OrderedDict (max 512 entries) keyed by provider + client identity + model + prompt hash + version + tools hash. A fresh entry is returned directly.Redis
Shared across all 16 workers via
voxcore:live_prompt_cache: keys. A hit is promoted into the local registry so siblings skip creation._LIVE_PROMPT_CACHE_REFRESH_RATIO = 0.9); least-recently-used entries are evicted when the registry is full.
Lifecycle ownership
VoxBridge owns the scheduled lifecycle; VoxCore is a consumer that prefers VoxBridge’s prewarmed name and self-heals when it expires.| Responsibility | Owner | Reference |
|---|---|---|
7am prewarm (default prewarm_hour=7) | VoxBridge | services/live_prompt_cache_lifecycle.py |
11pm cleanup (default cleanup_hour=23, disabled bots only) | VoxBridge | services/live_prompt_cache_lifecycle.py |
10h TTL (default ttl_hours=10) | VoxBridge | services/live_prompt_cache_lifecycle.py |
| Missed-run catch-up | VoxBridge | _catch_up_missed_runs |
| Inline single-flight recreate | VoxBridge | routes/internal_live_cache.py (/recreate) |
| Audit events | VoxBridge | services/live_prompt_cache_audit.py |
| Prefer prewarmed name, in-process registry, Redis sharing, near-TTL refresh | VoxCore | pipeline/live_prompt_cache.py |
Cache refresh is version-based, not delete-based. To force a refresh, bump
static_version (which changes the registry key) and invalidate the VoxBridge bot-config cache. Do not try to delete Gemini caches across all workers — old caches age out by TTL. Cleanup deliberately targets disabled bots only, because deleting an enabled bot’s cache nightly created a dead window between cleanup and the next prewarm.Cache-expiry recovery
A cached content name can become unusable mid-call — TTL expiry (400 INVALID_ARGUMENT ... is expired) or deletion/aging-out (404 ... cached content metadata ... not found). Both are recoverable. GoogleLLMServiceWithCacheRecovery / GoogleVertexLLMServiceWithCacheRecovery (src/voxcore/pipeline/google_llm_with_cache_recovery.py) handle it:
Detect during iteration
The Gemini stream is lazy — the error surfaces while iterating the response, not when the stream is awaited. VoxCore wraps the iteration itself so the error is caught.
Fetch a replacement
Reads the current prewarmed name from VoxBridge (
live_prompt_cache_state.cache_name); if missing or unchanged, POSTs to VoxBridge /recreate for an inline single-flight create.expired_in_call, swap_after_expiry) are posted to VoxBridge fire-and-forget.
Post-call context cache
Post-call analysis, QC, and callback extraction share the same registry/Redis pattern insrc/voxcore/processors/post_call.py (voxcore:post_call_cache: keys, max 512 entries, 90% TTL refresh).
Configuration
Enable per bot via either the legacy flat fields or the nestedpost_call_cache dict (_post_call.py reads both):
| Field | Purpose |
|---|---|
post_call_cache_enabled | Legacy flat toggle |
post_call_cache_version | Legacy flat version |
post_call_cache.enabled | Nested toggle (overrides flat) |
post_call_cache.analysis_version | Per-namespace version for analysis |
post_call_cache.qc_version | Per-namespace version for QC |
system_instruction is the prompt without the injected per-call date context — the date block is sent as fresh content so the cache key stays stable across calls (cache_system_prompt vs system_prompt in post_call.py).
VoxBridge computes the version from raw analysis/QC prompt templates, not rendered per-call values. Invalidation is version-based: bump the version, the registry key changes, and the next call creates a fresh
CachedContent. Never delete-based.OpenAI
Post-call context caching is not supported on OpenAI. Whencache_enabled is set on an OpenAI post-call call, the usage cache metadata reports status unsupported_provider with reason openai_post_call_cache_not_supported (_run_openai).
Stale retry
If a cached generate fails (expired cache), the post-call path evicts the registry entry and retries once withsystem_instruction inline (no cache). The usage cache metadata records status stale_retry.
Visibility
Every LLM usage entry carries acache dict (UsageEntry.cache in src/voxcore/models/results.py) with enabled, status, namespace, version, reason.
| Namespace | Where attached |
|---|---|
live_prompt | First live llm usage entry only |
post_call_analysis | Post-call analysis pass |
qc_analysis | QC pass |
callback_extraction | Callback-detection pass (caching disabled) |
status values include disabled, hit, created, fallback, ineligible, unsupported_provider, stale_retry.
VoxBridge’s build_llm_cache_summary (services/call_service.py) is scoped to post-call/QC only (type in {"post_call", "qc"}) so dashboard aggregate cards are not polluted by live-conversation cache data. The summary reports token facts only, never money. VoxUI Call Detail renders separate Live Prompt Cache and Post-Call Cache sections.
Related
Bot configuration
prompt_parts, post_call_cache, and related fields.Pipeline
Where the LLM service and live cache are wired into the pipeline.