Audio (Transcription, Translation & Speech)

The gateway provides three audio endpoints: transcription (speech-to-text), translation (speech-to-English), and speech synthesis (text-to-speech).

Transcriptions

Transcribe audio into text in the original language.

POST /v1/audio/transcriptions

Required capability: audio

Request Body

Parameter	Type	Required	Description
`model`	`string`	Yes	Transcription model (e.g. `whisper-1`).
`file`	`string`	Conditional	Base64-encoded audio data. Either `file` or `file_url` is required.
`file_url`	`string`	Conditional	URL to the audio file. Either `file` or `file_url` is required.
`language`	`string`	No	ISO 639-1 language code (e.g. `en`, `fr`, `de`). Improves accuracy when specified.
`prompt`	`string`	No	Optional text to guide the transcription style or provide context.
`response_format`	`string`	No	Output format: `"json"` (default), `"text"`, `"srt"`, `"verbose_json"`, or `"vtt"`.
`temperature`	`number`	No	Sampling temperature between 0 and 1.

Request size limit: 10 MB

Response

{
  "text": "The quick brown fox jumps over the lazy dog."
}

For verbose_json format, additional fields are included such as word-level timestamps.

Example

curl https://your-gateway.example.com/v1/audio/transcriptions \
  -H "Authorization: Bearer aigw_sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "whisper-1",
    "file_url": "https://example.com/audio/recording.mp3",
    "language": "en",
    "response_format": "json"
  }'

Translations

Translate audio into English text.

POST /v1/audio/translations

Required capability: audio

Request Body

Parameter	Type	Required	Description
`model`	`string`	Yes	Translation model (e.g. `whisper-1`).
`file`	`string`	Conditional	Base64-encoded audio data. Either `file` or `file_url` is required.
`file_url`	`string`	Conditional	URL to the audio file. Either `file` or `file_url` is required.
`prompt`	`string`	No	Optional text to guide the translation.
`response_format`	`string`	No	Output format: `"json"` (default), `"text"`, `"srt"`, `"verbose_json"`, or `"vtt"`.
`temperature`	`number`	No	Sampling temperature between 0 and 1.

Request size limit: 10 MB

Response

{
  "text": "The translated text in English."
}

Example

curl https://your-gateway.example.com/v1/audio/translations \
  -H "Authorization: Bearer aigw_sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "whisper-1",
    "file_url": "https://example.com/audio/french-speech.mp3",
    "response_format": "json"
  }'

Speech (Text-to-Speech)

Synthesize speech from text input.

POST /v1/audio/speech

Required capability: tts

Request Body

Parameter	Type	Required	Description
`model`	`string`	Yes	TTS model (e.g. `tts-1`, `tts-1-hd`).
`input`	`string`	Yes	The text to synthesize (1-4096 characters).
`voice`	`string`	Yes	Voice to use (e.g. `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`).
`response_format`	`string`	No	Audio format: `"mp3"` (default), `"opus"`, `"aac"`, `"flac"`, `"wav"`, or `"pcm"`.
`speed`	`number`	No	Playback speed from 0.25 to 4.0. Defaults to 1.0.

Request size limit: 1 MB

Response

The response body is the raw audio data with the appropriate Content-Type header (e.g. audio/mpeg for MP3). This is a binary response, not JSON.

Example

curl https://your-gateway.example.com/v1/audio/speech \
  -H "Authorization: Bearer aigw_sk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, welcome to the AI Gateway.",
    "voice": "alloy",
    "response_format": "mp3",
    "speed": 1.0
  }' \
  --output speech.mp3