
There are many talking lipsync AI tools out there. The inputs are usually text and an image of a person. The results are almost indistinguishable from a non-AI talking video. But when it comes to lipsync videos involving singing, that’s a whole different story. Generating realistic singing lipsync videos is apparently very challenging. Tools like Kling AI and Runway ML, despite being very popular tools for video generation, do a horrible job at this. After trying a number of tools, the two best ones I’ve found are TopMediAI and HeyGen. In this post, I’ll share my experience using them.
UPDATE 12/19/2025: Longcat Avatar is a new option that is worth trying and comparing against.
UPDATE 12/8/2025: There’s a new singing lipsync generator called WaveSpeed MultiTalk (WAN 2.1). Preliminary testing indicates that, with respect to video quality, MultiTalk is better than TopMediAI but not as good as Heygen. With respect to lipsync, Multitalk is just as good as TopMediAI and better than Heygen.
TopMediAI Singing Photo Maker
This tool does a decent job at creating singing lipsync videos, and the interface is very simple and intuitive. Though it’s designed for singing, it’s far from perfect.
Inputs
- upload an audio file (mp3) between 2 and 30 seconds
- upload an image of the character you want to sing

When generating a video using TopMediAI, sometimes, generation will fail repeatedly. From my experience, you have to keep trying 3-5 times until generation succeeds. It’s annoying, but it’ll eventually work.
HeyGen
This tool was designed for creating talking lipsync videos, not for singing. Nevertheless, it’s most advanced motion engine (Avatar IV) does a pretty good job a generating a singing lipsync video if you choose the “Quality” mode with a “Custom motion” value of “singing”. If you use the “Avatar Unlimited” engine, the results are just not good enough, in my opinion.
Update 12/8/2025: If you use the “Faster” generation mode, the quality appears to be just as good as the “Quality” mode, so just choose that mode since it costs half the cost of the “Quality” mode.

The process to create a lipsync video using HeyGen is more complex. Here are the steps:
- Click “Avatars” > “Create New” > “Start from a photo” >
- Upload a photo and wait for it to be processed

- Choose to create a new avatar or add the photo as a new “look” of an existing avatar. (One avatar can have multiple “looks”)
- Click “Create with AI Studio”

- Click “Audio” > “Upload Audio” , then upload your audio clip. You can upload a clip anywhere between 1 second and 3 minutes.

You can also choose from a previously uploaded audio.

- Play and confirm the uploaded/selected audio.

- HeyGen will attempt to transcribe the audio. If transcription fails, you won’t be able to proceed. In my experience, if it fails, it’s usually because the audio clip is too short. When I upload a longer clip, it usually can transcribe it. Note that the transcription can be wrong. This doesn’t appear to matter, as the video generation appears to be based on sound rather than words.

- Click “Generate”.
Comparing HeyGen to TopMediAI
Body movements
Neither TopMediAI nor HeyGen will make your character dance, but they will animate your character’s body to some extent. This is good, because older technologies literally only animated the lips or face and left everything else frozen/static. I feel that TopMediAI generates stronger body and lip movements, which makes the results look more realistic from that perspective.
Lipsync accuracy
When uploading a audio clip, it’s better to isolate the vocals from the backing track to prevent TopMediAI and HeyGen from getting confused. Neverthless, even when you upload the vocal track of a song, both AI tools occasionally produce inaccurate results, e.g., instead of lip movements to sing the word “hati”, TopMediAI made the lip movements as if to sing the word “hapi”; it wasn’t able to detect the difference between the “t” and “p” sounds. HeyGen seems to do a better job at lipsync accuracy.
Sustained vocal sounds
TopMediAI animates both the subject’s body and their lips to try to match the sounds in the audio file. This is particularly necessary for sustained vocal sounds, like in the following example.
Using the same inputs, and using HeyGen’s most advanced model (Avatar IV in “Quality” mode with a “Custom Motion” value of “Singing”, you can see below that HeyGen failed.
Video picture quality
With TopMediAI, if you upload an image of a zoomed-out character, even if it’s a hi-res image, the tool will have difficulty detecting the facial features, and the resulting video will be blurry with lots of artifacts. For that reason, I only upload images containing close-up shots of the character from the waist up. However, even then, the picture quality of the generated lipsync video deteriorates, sometimes significantly. For example, here’s the source image I uploaded to TopMediAI:

And here’s a frame from the generated video:

That’s a big difference.
HeyGen, on the other hand, does a much better job at preserving picture quality of the source. For example, compare the source and generated (screenshot) images below.


Teeth
TopMediAI can’t seem to produce consistent and natural-looking teeth. Sometimes, the results are acceptable, but other times, they are not. Compare the following.


HeyGen, on the other hand, does a very good job and showing natural, and almost perfect, teeth, as in this example:

Output resolution
With HeyGen, you can export videos up to 4K quality. With TopMediAI, there are no resolution options.

Recommendations
I would definitely use HeyGen’s Avatar IV with the “Quality” mode first to generating singing lipsync videos. If the results don’t look good, then I’d use TopMediaAI as a fallback.