Audio (Transcription, Translation & Speech)
The gateway provides three audio endpoints: transcription (speech-to-text), translation (speech-to-English), and speech synthesis (text-to-speech).
Transcriptions
Transcribe audio into text in the original language.
POST /v1/audio/transcriptionsRequired capability: audio
Request Body
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Transcription model (e.g. whisper-1). |
file | string | Conditional | Base64-encoded audio data. Either file or file_url is required. |
file_url | string | Conditional | URL to the audio file. Either file or file_url is required. |
language | string | No | ISO 639-1 language code (e.g. en, fr, de). Improves accuracy when specified. |
prompt | string | No | Optional text to guide the transcription style or provide context. |
response_format | string | No | Output format: "json" (default), "text", "srt", "verbose_json", or "vtt". |
temperature | number | No | Sampling temperature between 0 and 1. |
Request size limit: 10 MB
Response
{ "text": "The quick brown fox jumps over the lazy dog."}For verbose_json format, additional fields are included such as word-level timestamps.
Example
curl https://your-gateway.example.com/v1/audio/transcriptions \ -H "Authorization: Bearer aigw_sk_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "model": "whisper-1", "file_url": "https://example.com/audio/recording.mp3", "language": "en", "response_format": "json" }'Translations
Translate audio into English text.
POST /v1/audio/translationsRequired capability: audio
Request Body
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Translation model (e.g. whisper-1). |
file | string | Conditional | Base64-encoded audio data. Either file or file_url is required. |
file_url | string | Conditional | URL to the audio file. Either file or file_url is required. |
prompt | string | No | Optional text to guide the translation. |
response_format | string | No | Output format: "json" (default), "text", "srt", "verbose_json", or "vtt". |
temperature | number | No | Sampling temperature between 0 and 1. |
Request size limit: 10 MB
Response
{ "text": "The translated text in English."}Example
curl https://your-gateway.example.com/v1/audio/translations \ -H "Authorization: Bearer aigw_sk_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "model": "whisper-1", "file_url": "https://example.com/audio/french-speech.mp3", "response_format": "json" }'Speech (Text-to-Speech)
Synthesize speech from text input.
POST /v1/audio/speechRequired capability: tts
Request Body
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | Yes | TTS model (e.g. tts-1, tts-1-hd). |
input | string | Yes | The text to synthesize (1-4096 characters). |
voice | string | Yes | Voice to use (e.g. alloy, echo, fable, onyx, nova, shimmer). |
response_format | string | No | Audio format: "mp3" (default), "opus", "aac", "flac", "wav", or "pcm". |
speed | number | No | Playback speed from 0.25 to 4.0. Defaults to 1.0. |
Request size limit: 1 MB
Response
The response body is the raw audio data with the appropriate Content-Type header (e.g. audio/mpeg for MP3). This is a binary response, not JSON.
Example
curl https://your-gateway.example.com/v1/audio/speech \ -H "Authorization: Bearer aigw_sk_your_api_key" \ -H "Content-Type: application/json" \ -d '{ "model": "tts-1", "input": "Hello, welcome to the AI Gateway.", "voice": "alloy", "response_format": "mp3", "speed": 1.0 }' \ --output speech.mp3