Talk Photo API – Realistic Talking Face Animation with Sonic Model

VModel/talking-photo-sonic

The CVPR 2025 Sonic model enables photorealistic talking-face animation generation from a static portrait and corresponding audio input.

Output: $0.3 / second or 3 seconds / $1

Input

image * image

Input Image

audio * audio

Input audio file (WAV, MP3, etc.) for the voice.

Audio File

dynamic_scale float

Controls movement intensity. Increase/decrease for more/less movement.

min_resolution int

Minimum image resolution for processing. Lower values use less memory but may reduce quality.

inference_steps int

Number of diffusion steps. Higher values may improve quality but take longer.

keep_resolution boolean

If true, output video matches the original image resolution. Otherwise uses the min_resolution after cropping.

seed int

Random seed for reproducible results. Leave blank for a random seed.

disable_safety_checker boolean

Note: The website version of this model always runs with safety checks enabled. For details,see VModel's platform safety guidelines..

Disable safety checker for generated images

Reset

Output

{
  "task_id": "d9zzvghifs95q8fkfd",
  "user_id": 1,
  "version": "c6d80220ce71d8df04d5dbf2b189b70b9f4937aea6a030de12cb46951b24d134",
  "error": null,
  "total_time": 300,
  "predict_time": 300,
  "logs": null,
  "output": [
    "https://vmodel.ai/data/model/vmodel/talking-photo-sonic/output.mp4"
  ],
  "status": "succeeded",
  "create_at": 1746492954,
  "completed_at": 1746493015,
  "input": {
    "audio": "https://raw.githubusercontent.com/jixiaozhong/Sonic/main/examples/wav/talk_female_english_10s.MP3",
    "image": "https://raw.githubusercontent.com/jixiaozhong/Sonic/main/examples/image/anime1.png",
    "dynamic_scale": 1,
    "min_resolution": 512,
    "inference_steps": 25,
    "keep_resolution": false,
    "disable_safety_checker": false
  }
}

Generated in: 300 seconds

Download

Input

image * image

Input Image

audio * audio

Input audio file (WAV, MP3, etc.) for the voice.

Audio File

dynamic_scale float

Controls movement intensity. Increase/decrease for more/less movement.

min_resolution int

Minimum image resolution for processing. Lower values use less memory but may reduce quality.

inference_steps int

Number of diffusion steps. Higher values may improve quality but take longer.

keep_resolution boolean

If true, output video matches the original image resolution. Otherwise uses the min_resolution after cropping.

seed int

Random seed for reproducible results. Leave blank for a random seed.

disable_safety_checker boolean

Note: The website version of this model always runs with safety checks enabled. For details,see VModel's platform safety guidelines..

Disable safety checker for generated images

Reset

Output

{
  "task_id": "d9zzvghifs95q8fkfd",
  "user_id": 1,
  "version": "c6d80220ce71d8df04d5dbf2b189b70b9f4937aea6a030de12cb46951b24d134",
  "error": null,
  "total_time": 300,
  "predict_time": 300,
  "logs": null,
  "output": [
    "https://vmodel.ai/data/model/vmodel/talking-photo-sonic/output.mp4"
  ],
  "status": "succeeded",
  "create_at": 1746492954,
  "completed_at": 1746493015,
  "input": {
    "audio": "https://raw.githubusercontent.com/jixiaozhong/Sonic/main/examples/wav/talk_female_english_10s.MP3",
    "image": "https://raw.githubusercontent.com/jixiaozhong/Sonic/main/examples/image/anime1.png",
    "dynamic_scale": 1,
    "min_resolution": 512,
    "inference_steps": 25,
    "keep_resolution": false,
    "disable_safety_checker": false
  }
}

Generated in: 300 seconds

Download

HTTP Request

Run vmodel/talking-photo-sonic:c6d80220ce71d8df04d5dbf2b189b70b9f4937aea6a030de12cb46951b24d134 using Vmodel's HTTP API.

  curl -X POST https://api.vmodel.ai/api/tasks/v1/create
    -H "Authorization: Bearer $VModel_API_TOKEN"
    -H "Content-Type: application/json"
    -d '{
    "version": "c6d80220ce71d8df04d5dbf2b189b70b9f4937aea6a030de12cb46951b24d134",
    "input": {}
}'

Input Schema

The fields you can use to run this model with an API. If you don't give a value for a field its default value will be used.

image

Type: image

Default value: -

Description: Input Image

audio

Type: audio

Default value: -

Description: Input audio file (WAV, MP3, etc.) for the voice.

dynamic_scale

Type: float

Default value: 1

Description: Controls movement intensity. Increase/decrease for more/less movement.

Range: Min: 0.5 | Max: 2

min_resolution

Type: int

Default value: 512

Description: Minimum image resolution for processing. Lower values use less memory but may reduce quality.

Range: Min: 256 | Max: 1024

inference_steps

Type: int

Default value: 25

Description: Number of diffusion steps. Higher values may improve quality but take longer.

Range: Min: 5 | Max: 50

keep_resolution

Type: boolean

Default value: false

Description: If true, output video matches the original image resolution. Otherwise uses the min_resolution after cropping.

seed

Type: int

Default value: 0

Description: Random seed for reproducible results. Leave blank for a random seed.

Examples

Pricing

Model pricing for vmodel/talking-photo-sonic. Looking for volume pricing? Get in touch.

When

⚙ using this model

$0.3000

per second of input video

or 3 seconds for $1

Readme