Openai Whisper API – AI-Powered Convert speech in audio to text

VModel/whisper

Convert speech in audio to text.

Output: $0.01 / use or 100 uses / $1

Input

audio * audio

The audio file to transcribe

Audio File

transcription enum

The format of the transcription output. Default: plain text

translate boolean

Whether to translate the speech to English. Default: false

language enum

Language spoken in the audio, specify 'auto' for automatic language detection

temperature float

temperature to use for sampling. Default 0

patience float

optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search.

suppress_tokens string

comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations. Default: -1

initial_prompt string

optional text to provide as a prompt for the first window.

condition_on_previous_text boolean

if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop. Default: true

temperature_increment_on_fallback float

temperature to increase when falling back when the decoding fails to meet either of the thresholds below. Default: 0.2

compression_ratio_threshold float

if the gzip compression ratio is higher than this value, treat the decoding as failed. Default: 2.4

logprob_threshold float

if the average log probability is lower than this value, treat the decoding as failed. Defaults to -1.

no_speech_threshold float

if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence. Default 0.6

Reset

Output

{
  "task_id": "d9zzvghifs95q8fkfd",
  "user_id": 1,
  "version": "8099696689d249cf8b122d833c36ac3f75505c666a395ca40ef26f68e7d3d16e",
  "error": null,
  "total_time": 6.41,
  "predict_time": 6.41,
  "logs": null,
  "output": [
    "{\"detected_language\":\"chinese\",\"segments\":[{\"avg_logprob\":-0.13921591999766592,\"compression_ratio\":1.1081081081081081,\"end\":10.1,\"id\":0,\"no_speech_prob\":0.08148325234651566,\"seek\":0,\"start\":0,\"temperature\":0,\"text\":\"宝贝,欢迎收听凯书365页,也感谢你关注凯书讲故事的微信公众账号和APP软件。\",\"tokens\":[50365,2415,251,18464,251,11,28566,17699,18681,31022,6336,107,2930,99,11309,20,10178,113,11,6404,9709,11340,2166,28053,26432,6336,107,2930,99,39255,43045,6973,1546,39152,17665,13545,7384,245,18464,99,26987,12565,8749,17819,107,20485,1543,50870]},{\"avg_logprob\":-0.13921591999766592,\"compression_ratio\":1.1081081081081081,\"end\":22.400000000000002,\"id\":1,\"no_speech_prob\":0.08148325234651566,\"seek\":0,\"start\":11.040000000000001,\"temperature\":0,\"text\":\"今天凯书要给你讲一个成语故事,叫做《悲公蛇影》,这个故事发生在东汉年间。\",\"tokens\":[50917,12074,6336,107,2930,99,4275,23197,2166,39255,20182,11336,5233,255,43045,6973,11,19855,10907,9806,14696,110,13545,26145,229,16820,9782,11,15368,43045,6973,28926,8244,3581,38409,12800,231,5157,31685,1543,51485]},{\"avg_logprob\":-0.10092328843616304,\"compression_ratio\":0.8070175438596491,\"end\":29.06,\"id\":2,\"no_speech_prob\":0.13098260760307312,\"seek\":2240,\"start\":22.4,\"temperature\":0,\"text\":\"话说这是一年盛夏,天气燥热得很。\",\"tokens\":[50365,21596,8090,27455,2257,5157,5419,249,42708,11,6135,42204,24184,98,23661,255,5916,4563,1543,50698]}],\"transcription\":\"宝贝,欢迎收听凯书365页,也感谢你关注凯书讲故事的微信公众账号和APP软件。今天凯书要给你讲一个成语故事,叫做《悲公蛇影》,这个故事发生在东汉年间。话说这是一年盛夏,天气燥热得很。\",\"translation\":null}"
  ],
  "status": "succeeded",
  "create_at": 1746492954,
  "completed_at": 1746493015,
  "input": {
    "seed": 0,
    "audio": "https://vmodel.ai/data/dev/model/vmodel/whisper/007_output_01.mp3",
    "model": "large-v3",
    "transcription": "plain text",
    "translate": false,
    "language": "auto",
    "temperature": 0,
    "suppress_tokens": "-1",
    "initial_prompt": "",
    "condition_on_previous_text": true,
    "temperature_increment_on_fallback": 0.2,
    "compression_ratio_threshold": 2.4,
    "logprob_threshold": -1,
    "no_speech_threshold": 0.6
  }
}

Generated in: 6.41 seconds

Input

audio * audio

The audio file to transcribe

Audio File

transcription enum

The format of the transcription output. Default: plain text

translate boolean

Whether to translate the speech to English. Default: false

language enum

Language spoken in the audio, specify 'auto' for automatic language detection

temperature float

temperature to use for sampling. Default 0

patience float

optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search.

suppress_tokens string

comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations. Default: -1

initial_prompt string

optional text to provide as a prompt for the first window.

condition_on_previous_text boolean

temperature_increment_on_fallback float

temperature to increase when falling back when the decoding fails to meet either of the thresholds below. Default: 0.2

compression_ratio_threshold float

if the gzip compression ratio is higher than this value, treat the decoding as failed. Default: 2.4

logprob_threshold float

if the average log probability is lower than this value, treat the decoding as failed. Defaults to -1.

no_speech_threshold float

if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence. Default 0.6

Reset

Output

{
  "task_id": "d9zzvghifs95q8fkfd",
  "user_id": 1,
  "version": "8099696689d249cf8b122d833c36ac3f75505c666a395ca40ef26f68e7d3d16e",
  "error": null,
  "total_time": 6.41,
  "predict_time": 6.41,
  "logs": null,
  "output": [
    "{\"detected_language\":\"chinese\",\"segments\":[{\"avg_logprob\":-0.13921591999766592,\"compression_ratio\":1.1081081081081081,\"end\":10.1,\"id\":0,\"no_speech_prob\":0.08148325234651566,\"seek\":0,\"start\":0,\"temperature\":0,\"text\":\"宝贝,欢迎收听凯书365页,也感谢你关注凯书讲故事的微信公众账号和APP软件。\",\"tokens\":[50365,2415,251,18464,251,11,28566,17699,18681,31022,6336,107,2930,99,11309,20,10178,113,11,6404,9709,11340,2166,28053,26432,6336,107,2930,99,39255,43045,6973,1546,39152,17665,13545,7384,245,18464,99,26987,12565,8749,17819,107,20485,1543,50870]},{\"avg_logprob\":-0.13921591999766592,\"compression_ratio\":1.1081081081081081,\"end\":22.400000000000002,\"id\":1,\"no_speech_prob\":0.08148325234651566,\"seek\":0,\"start\":11.040000000000001,\"temperature\":0,\"text\":\"今天凯书要给你讲一个成语故事,叫做《悲公蛇影》,这个故事发生在东汉年间。\",\"tokens\":[50917,12074,6336,107,2930,99,4275,23197,2166,39255,20182,11336,5233,255,43045,6973,11,19855,10907,9806,14696,110,13545,26145,229,16820,9782,11,15368,43045,6973,28926,8244,3581,38409,12800,231,5157,31685,1543,51485]},{\"avg_logprob\":-0.10092328843616304,\"compression_ratio\":0.8070175438596491,\"end\":29.06,\"id\":2,\"no_speech_prob\":0.13098260760307312,\"seek\":2240,\"start\":22.4,\"temperature\":0,\"text\":\"话说这是一年盛夏,天气燥热得很。\",\"tokens\":[50365,21596,8090,27455,2257,5157,5419,249,42708,11,6135,42204,24184,98,23661,255,5916,4563,1543,50698]}],\"transcription\":\"宝贝,欢迎收听凯书365页,也感谢你关注凯书讲故事的微信公众账号和APP软件。今天凯书要给你讲一个成语故事,叫做《悲公蛇影》,这个故事发生在东汉年间。话说这是一年盛夏,天气燥热得很。\",\"translation\":null}"
  ],
  "status": "succeeded",
  "create_at": 1746492954,
  "completed_at": 1746493015,
  "input": {
    "seed": 0,
    "audio": "https://vmodel.ai/data/dev/model/vmodel/whisper/007_output_01.mp3",
    "model": "large-v3",
    "transcription": "plain text",
    "translate": false,
    "language": "auto",
    "temperature": 0,
    "suppress_tokens": "-1",
    "initial_prompt": "",
    "condition_on_previous_text": true,
    "temperature_increment_on_fallback": 0.2,
    "compression_ratio_threshold": 2.4,
    "logprob_threshold": -1,
    "no_speech_threshold": 0.6
  }
}

Generated in: 6.41 seconds

HTTP Request

Run vmodel/whisper:8099696689d249cf8b122d833c36ac3f75505c666a395ca40ef26f68e7d3d16e using Vmodel's HTTP API.

  curl -X POST https://api.vmodel.ai/api/tasks/v1/create
    -H "Authorization: Bearer $VModel_API_TOKEN"
    -H "Content-Type: application/json"
    -d '{
    "version": "8099696689d249cf8b122d833c36ac3f75505c666a395ca40ef26f68e7d3d16e",
    "input": {}
}'

Input Schema

The fields you can use to run this model with an API. If you don't give a value for a field its default value will be used.

audio

Type: audio

Default value: -

Description: The audio file to transcribe

transcription

Type: enum

Default value: plain text

Description: The format of the transcription output. Default: plain text

Choices: plain text, srt, vtt

translate

Type: boolean

Default value: false

Description: Whether to translate the speech to English. Default: false

language

Type: enum

Default value: auto

Description: Language spoken in the audio, specify 'auto' for automatic language detection

Choices: auto, af, am, ar, as, az, ba, be, bg, bn, bo, br, bs, ca, cs, cy, da, de, el, en, es, et, eu, fa, fi, fo, fr, gl, gu, ha, haw, hi, hr, ht, hu, hy, id, is, it, iw, ja, jw, ka, kk, km, kn, ko, la, lb, ln, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, nn, no, oc, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, sn, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, uk, ur, uz, vi, yi, yo, zh, Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Bashkir, Basque, Belarusian, Bengali, Bosnian, Breton, Bulgarian, Burmese, Castilian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Faroese, Finnish, Flemish, French, Galician, Georgian, German, Greek, Gujarati, Haitian, Haitian Creole, Hausa, Hawaiian, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Lao, Latin, Latvian, Letzeburgesch, Lingala, Lithuanian, Luxembourgish, Macedonian, Malagasy, Malay, Malayalam, Maltese, Maori, Marathi, Moldavian, Moldovan, Mongolian, Myanmar, Nepali, Norwegian, Nynorsk, Occitan, Panjabi, Pashto, Persian, Polish, Portuguese, Punjabi, Pushto, Romanian, Russian, Sanskrit, Serbian, Shona, Sindhi, Sinhala, Sinhalese, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Tibetan, Turkish, Turkmen, Ukrainian, Urdu, Uzbek, Valencian, Vietnamese, Welsh, Yiddish, Yoruba

temperature

Type: float

Default value: 0

Description: temperature to use for sampling. Default 0

patience

Type: float

Default value: 1

Description: optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search.

suppress_tokens

Type: string

Default value: -1

Description: comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations. Default: -1

initial_prompt

Type: string

Default value:

Description: optional text to provide as a prompt for the first window.

condition_on_previous_text

Type: boolean

Default value: true

Description: if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop. Default: true

temperature_increment_on_fallback

Type: float

Default value: 0.2

Description: temperature to increase when falling back when the decoding fails to meet either of the thresholds below. Default: 0.2

compression_ratio_threshold

Type: float

Default value: 2.4

Description: if the gzip compression ratio is higher than this value, treat the decoding as failed. Default: 2.4

logprob_threshold

Type: float

Default value: -1

Description: if the average log probability is lower than this value, treat the decoding as failed. Defaults to -1.

no_speech_threshold

Type: float

Default value: 0.6

Description: if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence. Default 0.6

Pricing

Model pricing for vmodel/whisper. Looking for volume pricing? Get in touch.

When

⚙ using this model

$0.0100

per use

or 100 uses for $1

Readme