vmodel/talking-photo-sonic
The CVPR 2025 Sonic model enables photorealistic talking-face animation generation from a static portrait and corresponding audio input.
Output: $0.08 / second or 12 seconds / $1
Input
image * image
Input Image
audio * video
Input audio file (WAV, MP3, etc.) for the voice.
dynamic_scale float
Controls movement intensity. Increase/decrease for more/less movement.
min_resolution int
Minimum image resolution for processing. Lower values use less memory but may reduce quality.
inference_steps int
Number of diffusion steps. Higher values may improve quality but take longer.
keep_resolution boolean
If true, output video matches the original image resolution. Otherwise uses the min_resolution after cropping.
seed int
Random seed for reproducible results. Leave blank for a random seed.
Reset
Output
{
  "task_id": "d9zzvghifs95q8fkfd",
  "user_id": 1,
  "version": "cf50350d63bbe4178e97bd144aaae86167255ac1d33b09a0662fb9c195ad6f55",
  "error": null,
  "total_time": 300,
  "predict_time": 300,
  "logs": null,
  "output": [
    "https://vmodel.ai/data/model/vmodel/talking-photo-sonic/output.mp4"
  ],
  "status": "succeeded",
  "create_at": 1746492954,
  "completed_at": 1746493015,
  "input": {
    "audio": "https://raw.githubusercontent.com/jixiaozhong/Sonic/main/examples/wav/talk_female_english_10s.MP3",
    "image": "https://raw.githubusercontent.com/jixiaozhong/Sonic/main/examples/image/anime1.png",
    "dynamic_scale": 1,
    "min_resolution": 512,
    "inference_steps": 25,
    "keep_resolution": false
  }
}
Generated in: 300 seconds
Download
Examples
Pricing
This model is priced based on the length of the video.
Output: $0.08 / second or 12 seconds / $1
Readme

About the Sonic Model

Sonic is an innovative audio-driven portrait animation model that goes beyond traditional lip-sync techniques. By leveraging global audio features—such as tone, rhythm, and emotional cues—it generates natural and expressive facial animations, including subtle head movements. This helps avoid the stiff, “puppet-like” appearance often seen in older methods, resulting in more lifelike and engaging visuals.

✨ Key Features

  • Expressive Facial & Head Movements: Captures tone and rhythm from audio to drive nuanced facial expressions and subtle head motion.
  • Accurate Lip Sync: Precisely matches lip movements with spoken audio.
  • Single Image Input: Requires only one frontal or near-frontal portrait image for animation.
  • Supports Common Audio Formats: Compatible with WAV, MP3, and other standard formats.
  • Robust Face Handling: Uses YOLOv5 for face detection and cropping. Falls back to the original image if no face is detected.

🔬 Technical Highlights

Sonic integrates several advanced technologies:

  • Audio Feature Extraction: Likely uses models like Whisper to encode rich tonal and rhythmic features from audio.
  • Video Diffusion Generation: Employs techniques similar to Stable Video Diffusion to produce smooth, coherent video sequences.
  • Temporal Consistency: Integrates methods like RIFE for frame interpolation to enhance visual smoothness.
  • Accurate Face Localization: Powered by YOLOv5 to ensure focus on facial regions.
  • Global Audio Perception: Core innovation enabling holistic expression mapping driven by the overall audio context.

💡 Use Cases

  • Animating avatars for virtual assistants, digital humans, and game characters
  • Creating short video content from voice recordings and static images
  • Developing accessibility tools with expressive visual feedback
  • Enabling creative and social media content generation

⚠️ Limitations

  • Best results are achieved with high-resolution, clear, front-facing portrait images
  • Designed for facial and subtle head animations only—not full-body motion
  • Lip-sync accuracy may be affected by audio quality (e.g., background noise, unclear speech)
  • Expression mapping is learned and may not perfectly capture every nuance intended by the speaker