Skip to main content

Overview

GoModel exposes the OpenAI-compatible audio endpoints for text-to-speech (TTS) and speech-to-text (STT). Clients and SDKs that already call OpenAI’s /v1/audio/* routes can point at GoModel unchanged. Requests route by model through the same registry used for chat and embeddings, so model selection, provider hints, model aliases, per-key model access rules (user paths), and budgets all apply. Audio is served by OpenAI and the OpenAI-compatible providers (OpenRouter, Azure OpenAI, vLLM, Oracle, MiniMax, Z.ai); a provider that doesn’t support audio returns a clear error rather than mis-routing.

Supported endpoints

EndpointBehavior
POST /v1/audio/speechText-to-speech. Accepts a JSON body and returns binary audio in the requested response_format.
POST /v1/audio/transcriptionsSpeech-to-text. Accepts a multipart/form-data upload and returns JSON or plain text per response_format.

Text-to-speech

curl https://your-gateway/v1/audio/speech \
  -H "Authorization: Bearer $GOMODEL_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini-tts",
    "input": "Hello from GoModel.",
    "voice": "alloy",
    "response_format": "wav"
  }' \
  --output speech.wav
model, input, and voice are required. Optional fields — instructions, response_format (mp3 default, plus opus, aac, flac, wav, pcm), and speed — are forwarded to the provider. The response Content-Type is derived from response_format (for example wavaudio/wav).

Speech-to-text

curl https://your-gateway/v1/audio/transcriptions \
  -H "Authorization: Bearer $GOMODEL_KEY" \
  -F "file=@speech.wav" \
  -F "model=gpt-4o-transcribe" \
  -F "response_format=json"
file and model are required. Optional form fields — language, prompt, response_format, temperature, and timestamp_granularities[] — are forwarded. response_format controls the response shape: json and verbose_json return a JSON object; text, srt, and vtt return a text/plain body.
The bracketed timestamp_granularities[] form key is canonical, but GoModel also accepts the unbracketed timestamp_granularities for client compatibility.

Limitations

The audio endpoints are a thin, model-routed pass to the provider and do not run through the full inference orchestrator. Compared with /v1/chat/completions:
  • No failover, guardrails, or response cache — these stages are skipped.
  • No usage/cost metering — audio is not token-priced, so it is not recorded in usage tracking. Requests are still authorized, budget-checked, and written to the audit log under their /v1/audio/* path.
  • OpenAI request shape only — requests are forwarded in OpenAI’s audio format to OpenAI-compatible upstreams. Providers with a different native audio contract are not yet adapted behind this endpoint.
  • Realtime voice-to-voice (the WebSocket realtime API) is not supported.
For a provider whose native audio API differs from OpenAI’s, use the passthrough API (/p/{provider}/v1/audio/...) to forward bytes verbatim to that upstream.

Audit logging

Audio requests appear in the audit log like any other model interaction. Because audio payloads are binary and large, their bodies are gated by a dedicated setting, LOGGING_LOG_AUDIO_BODIES (default false), which refines LOGGING_LOG_BODIES — it has no effect unless body logging is enabled:
  • Body logging off (LOGGING_LOG_BODIES=false) — no audio body is stored, regardless of this setting.
  • Body logging on, audio off (the default) — the audio response is recorded as a lightweight {__audio__, content_type, bytes, stored: false} placeholder; no audio bytes are stored.
  • Body logging on, audio on/v1/audio/speech stores its text input and the generated audio (base64, capped at 8 MB) so the dashboard renders an inline player, and /v1/audio/transcriptions stores the uploaded audio (base64, capped at 8 MB, also playable in the dashboard) alongside the upload metadata (filename, model, params).