How to Create a Cover Song in Suno AI

Let’s say you have the rights to a song, e.g, a song that’s in the public domain, and you want to create a cover for it by only replacing the lyrics. Here’s how you can do it using ChatGPT and Suno AI.

Upload the original song to Suno

For this example, I uploaded this song.

Suno will add your uploaded song to your workspace, as shown below. When you click on the song’s title, you’ll see an auto-generated style description of the song and the lyrics.

Song Style

A French pop song with a moderate tempo and a romantic, dreamy atmosphere, The instrumentation features a prominent acoustic guitar playing arpeggiated chords, a bass guitar providing a smooth, walking bass line, and a drum kit with a soft, brushed snare sound, Synthesizers contribute to the ethereal quality with pad sounds and occasional melodic lines, The female lead vocalist sings with a soft, breathy tone, employing a gentle vibrato, The song structure follows a verse-chorus pattern with a bridge, The chord progression is primarily diatonic, creating a sense of warmth and familiarity, Reverb is applied generously to the vocals and some instrumental elements, enhancing the dreamy quality, The overall mix is balanced, with the vocals sitting clearly in the foreground

Song Lyrics

[Intro]
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)

[Verse 1]
Quand tu me souris
Mon cœur s’envole
Tout devient doux
Tout prend son rôle
Ton nom résonne
Comme une chanson
Dans mes rêves
Tu es ma raison

[Chorus]
Je me sens au ciel de t’aimer
Comme un ange qui vient d’naître
Ton amour me fait trembler
Sous la lune, je veux renaître

[Verse 2]
Tes yeux brillent comme l’été
Chaque nuit, je veux t’aimer
Ta voix douce me fait rêver
Je t’appelle sans m’arrêter

[Chorus]
Je me sens au ciel de t’aimer
Quand tes mains touchent ma peau
Le temps s’arrête enchanté
Dans ton cœur, je trouve mon écho

[Bridge]
Même si le jour se lève
Et que tout s’enfuit
Ton amour reste mon rêve
Mon paradis, c’est toi mon fruit

[Chorus]
Je me sens au ciel de t’aimer
Chaque instant, je veux y croire
Je me sens

[Outro]
Au ciel de t’aimer
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)

Generate new lyrics

To create new lyrics, the easiest way is to ask ChatGPT to do it for you. For example, here’s a prompt I gave ChatGPT to generate new lyrics that meet my specific criteria:

I want to create a cover of a French song. The lyrics are below. Give me English lyrics that have a similar theme and can be used as a drop-in replacement for the French lyrics so I can use the Suno cover feature. If the French song has a rhyming pattern, the new lyrics should rhyme as well. If possible, try to match the syllable count, but prioritize the flow and meaning of the lyrics over the syllable count.

The lyrics below were auto-generated by Suno.

[Intro]
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)
(Paru paru pa)

[Verse 1]
Quand tu me souris
Mon cœur s’envole
Tout devient doux
Tout prend son rôle
…

ChatGPT will then respond with the lyrics. Here’s what it gave me:

[Intro]

(Para para pa)
(Para para pa)
(Para para pa)
(Para para pa)
(Para para pa)
(Para para pa)
(Para para pa)
(Para para pa)

VERSE 1

When you smile at me,
My heart goes free,
Everything glows,
Warm like it knows,
Your name rings on,
Soft as a song,
In dreams each night,
You are my light.

[Chorus]

Your love lifts me higher each day,
Like an angel learning to fly,
Every touch sends my worries away,
Under moonlight, you open the sky.

[Verse 2]

Your eyes shine bright, like summer air,
Every night I find you there,
Your soft voice brings me into dream,
Calling you flows like a stream.

[Chorus]

Your love lifts me higher each day,
When your hands are holding mine,
Time stands still in your gentle way,
In your heart, all my echoes align.

[Bridge]

Even when the morning shines,
And the world moves on,
Still your love stays in my mind,
You’re the dream I lean upon.

[Chorus]

Your love lifts me higher each day,
Every moment feels so true,
I feel you—
So high in love with you.

[Outro]

Floating through the sky
(Para para pa)
(Para para pa)
(Para para pa)
(Para para pa)
(Para para pa)
(Para para pa)
(Para para pa)
(Para para pa)

Create cover song

Now that you have new lyrics, go back to Suno and click on the 3 dots next to the original song title, the click “Remix/Edit” > “Cover”

In the left pane, Suno will load the original song and lyrics. Replace the original lyrics with your new lyrics. In the “Style” field, paste the Suno-generated style description. Suno will also set some advanced options, like “weirdness”, “style influence”, and “audio influence”. You can keep the defaults.

Click “Create”. Suno will create 2 cover songs, as shown below.

Edit the cover song

After listening to the two cover songs, I like the second one more, but one section didn’t sound right. Some of the lines in verse 2 sounded rushed. To fix this, edit the song by clicking on “Open in Editor”.

With the editor open, you’ll see the song’s waveform, color-coded by section. Click on the section containing the lyrics you wan to edit. In this example, that’s the pink section shown below. When you click on it, the lyrics for the section will be selected in the lyrics box on the left. You can then type in revised lyrics in the “new lyrics” box below it. In this case, I made some of the lyrics shorter (fewer words).

Click the “Replace” button. Suno will generate two alternate versions of that section with the modified lyrics you provided. Click the play button beside each one to preview the alternate versions. If you don’t like either one, click “Regenerate” to generate more versions. When you like a version, click “Commit” to replace the section with the new section.

When you’re done editing, click “Save as new song”. The edited song will appear in your workspace.

You can then download the song.

Here’s the cover song in English. As you’ll hear, the backing instrumentals sound almost identical to the original French song, but the lyrics are new.

Cheaply Create High-Quality Images Inspired by Existing Images Using AI

Recently, I needed some high-quality Mediterranean images. I tried searching stock photo libraries, but they were expensive, it took too long, and the images weren’t that good or what I was really looking for. I found several videos on YouTube that had the type of images I wanted, but I didn’t want to copy them exactly, so I used AI to create new images inspired by them. Here’s how I created them.

AI Model: SeeDream 4.0

Prompt: Create a Greek home using the same colors, lighting, and elements from the reference image, but it should look different from the reference image.

Reference Image:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image using the same colors, lighting, and elements from the reference image, but it should look different from the reference image. The perspective should be 45 degrees from the perspective of the reference image, facing the sea.

Reference Image:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same composition and view as the first one, but use colors and materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same composition and view as the first one, but use colors and materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same composition and view as the first one, but use colors and materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same, layout, composition and view as the first one, but use colors and materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same, layout, composition and view as the first one, but use colors and materials from the second one. Keep the blue and white tile in the first image as is.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same, layout, composition, elements and view/framing as the first one, but use colors and building materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same, layout, composition, elements and view/framing as the first one, but use colors and building materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same, layout, composition, elements and view/framing as the first one, but use colors and building materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same, layout, composition, elements and view/framing as the first one, but use colors and building materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same, layout, composition, elements and view/framing as the first one, but use colors and building materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same, layout, composition, elements and view/framing as the first one, but use colors and building materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same, layout, composition, elements and view/framing as the first one, but use colors and building materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same, layout, composition, elements and view/framing as the first one, but use colors and building materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same, layout, composition, elements and view/framing as the first one, but use colors and building materials from the second one.

Reference Images:

AI-Generated Image:

AI Model: SeeDream 4.0

Prompt: Create an image with the same, layout, composition, elements and view/framing as the first one, but use colors and building materials from the second one.

Reference Images:

AI-Generated Image:

I ended up using the images to create 5-second background videos using Kling AI for this music video:

Comparing TopMediAI to HeyGen for Singing Lipsync AI Video Generation

There are many talking lipsync AI tools out there. The inputs are usually text and an image of a person. The results are almost indistinguishable from a non-AI talking video. But when it comes to lipsync videos involving singing, that’s a whole different story. Generating realistic singing lipsync videos is apparently very challenging. Tools like Kling AI and Runway ML, despite being very popular tools for video generation, do a horrible job at this. After trying a number of tools, the two best ones I’ve found are TopMediAI and HeyGen. In this post, I’ll share my experience using them.

TopMediAI Singing Photo Maker

Website

This tool does a decent job at creating singing lipsync videos, and the interface is very simple and intuitive. Though it’s designed for singing, it’s far from perfect.

Inputs

upload an audio file (mp3) between 2 and 30 seconds
upload an image of the character you want to sing

When generating a video using TopMediAI, sometimes, generation will fail repeatedly. From my experience, you have to keep trying 3-5 times until generation succeeds. It’s annoying, but it’ll eventually work.

HeyGen

Website

This tool was designed for creating talking lipsync videos, not for singing. Nevertheless, it’s most advanced motion engine (Avatar IV) does a pretty good job a generating a singing lipsync video if you choose the “Quality” mode with a “Custom motion” value of “singing”. If you use the “Avatar Unlimited” engine, the results are just not good enough, in my opinion.

The process to create a lipsync video using HeyGen is more complex. Here are the steps:

Click “Avatars” > “Create New” > “Start from a photo” >
Upload a photo and wait for it to be processed

Choose to create a new avatar or add the photo as a new “look” of an existing avatar. (One avatar can have multiple “looks”)
Click “Create with AI Studio”

Click “Audio” > “Upload Audio” , then upload your audio clip. You can upload a clip anywhere between 1 second and 3 minutes.

You can also choose from a previously uploaded audio.

Play and confirm the uploaded/selected audio.

HeyGen will attempt to transcribe the audio. If transcription fails, you won’t be able to proceed. In my experience, if it fails, it’s usually because the audio clip is too short. When I upload a longer clip, it usually can transcribe it. Note that the transcription can be wrong. This doesn’t appear to matter, as the video generation appears to be based on sound rather than words.

Click “Generate”.

Comparing HeyGen to TopMediAI

Body movements

Neither TopMediAI nor HeyGen will make your character dance, but they will animate your character’s body to some extent. This is good, because older technologies literally only animated the lips or face and left everything else frozen/static. I feel that TopMediAI generates stronger body and lip movements, which makes the results look more realistic from that perspective.

Lipsync accuracy

When uploading a audio clip, it’s better to isolate the vocals from the backing track to prevent TopMediAI and HeyGen from getting confused. Neverthless, even when you upload the vocal track of a song, both AI tools occasionally produce inaccurate results, e.g., instead of lip movements to sing the word “hati”, TopMediAI made the lip movements as if to sing the word “hapi”; it wasn’t able to detect the difference between the “t” and “p” sounds. HeyGen seems to do a better job at lipsync accuracy.

Sustained vocal sounds

TopMediAI animates both the subject’s body and their lips to try to match the sounds in the audio file. This is particularly necessary for sustained vocal sounds, like in the following example.

Using the same inputs, and using HeyGen’s most advanced model (Avatar IV in “Quality” mode with a “Custom Motion” value of “Singing”, you can see below that HeyGen failed.

Video picture quality

With TopMediAI, if you upload an image of a zoomed-out character, even if it’s a hi-res image, the tool will have difficulty detecting the facial features, and the resulting video will be blurry with lots of artifacts. For that reason, I only upload images containing close-up shots of the character from the waist up. However, even then, the picture quality of the generated lipsync video deteriorates, sometimes significantly. For example, here’s the source image I uploaded to TopMediAI:

And here’s a frame from the generated video:

That’s a big difference.

HeyGen, on the other hand, does a much better job at preserving picture quality of the source. For example, compare the source and generated (screenshot) images below.

Teeth

TopMediAI can’t seem to produce consistent and natural-looking teeth. Sometimes, the results are acceptable, but other times, they are not. Compare the following.

HeyGen, on the other hand, does a very good job and showing natural, and almost perfect, teeth, as in this example:

Output resolution

With HeyGen, you can export videos up to 4K quality. With TopMediAI, there are no resolution options.

Recommendations

I would definitely use HeyGen’s Avatar IV with the “Quality” mode first to generating singing lipsync videos. If the results don’t look good, then I’d use TopMediaAI as a fallback.

Camera Shots, Angles, and Movements For Generating AI Images and Videos

When generating images and videos using AI, you need to include in your prompt how the camera takes the picture or video. Following are common camera shots and movements you can reference when creating your prompts. The example images below were all generated using Nano Banana. The videos were generated using Kling AI.

Shot Types (Distance and Framing)

Shot Type	Description
Extreme Close-Up (ECU)	Focuses tightly on one detail (eyes, mouth, hand). Great for emotional intensity.
Close-Up (CU)	Shows the subject’s head and shoulders — captures facial emotion clearly.
Medium Shot (MS)	Shows the subject from waist up — good balance between subject and background.
Full Shot (FS)	Shows the entire body of the subject within the frame.
Wide Shot (WS)	Shows the subject and full surroundings — emphasizes environment.
Extreme Wide Shot (EWS)	Subject is small within a large landscape — epic and cinematic.
Over-the-Shoulder (OTS)	Camera is behind one person’s shoulder, focusing on what they’re looking at.
POV (Point of View)	The camera shows what the character sees — immersive perspective.

Extreme Close-Up

Close-Up

Medium Shot

Full Shot

Wide Shot

Extreme Wide Shot

Over-the-Shoulder Shot

POV (Point of View) Shot

Camera Angles

Angle	Description
Eye-Level	Neutral, natural perspective — like the viewer’s eye line.
High Angle	Camera looks down on the subject — makes them seem small or vulnerable.
Low Angle	Camera looks up at the subject — makes them seem powerful or heroic.
Dutch Angle (Tilted)	Camera is tilted diagonally — adds tension or unease.
Bird’s Eye View / Top-Down	Shot from directly above — good for movement, choreography, or maps.

Eye-Level, medium shot

High-angle, medium shot

Low-angle, medium shot

Dutch Angle (Tilted), medium shot

Bird’s Eye View / Top-Down shot

Camera Movements (Dynamic Shots)

Movement	Description
Static	Camera doesn’t move — perfect for portraits or emotional moments.
Pan	Camera rotates horizontally left ↔ right.
Tilt	Camera moves vertically up ↕ down.
Dolly (Push/Pull)	Camera moves forward or backward smoothly on rails. Great for dramatic reveals.
Truck (Left/Right)	Camera moves side to side — similar to dolly but horizontally.
Crane / Jib	Camera moves up or down through large vertical space — majestic motion.
Orbit / 360° Move	Camera circles around the subject — cinematic hero shot.
Tracking / Follow Shot	Camera moves following the subject — conveys movement and energy.
Zoom In / Out	Lens zooms, not physical movement — adds focus or emotional punch.
Handheld	Shaky or organic motion — feels immersive or documentary-style.
Drone Shot	High-altitude or sweeping view — ideal for landscapes, travel, or sports.

Static

Pans Left to Right

Tilt Upward

Dolly Pull

Orbit / Circles

Track / Follow

Zoom Out

A Brief Overview of Artificial Intelligence (AI)

Artificial Intelligence (AI) is about teaching computers to do smart things that normally require human intelligence — such as understanding language, recognizing faces, playing games, or creating art and music.

AI learns from lots of examples so it can notice patterns. For example, if you show AI thousands of pictures of cats and dogs, it will eventually know the difference between what a cat looks like and what a dog looks like. It doesn’t know what a cat or dog does; it just knows what a cat or dog looks like. Data is the fuel of AI. The more data and the cleaner the data, the better the AI is. AI improves by trial and error. Initially, AI will guess what a cat or dog looks like. If it makes a mistake and humans correct it, AI will learn and, eventually, not make the same mistake.

Huge Neural Networks

AI uses neural networks, which are like a simplified brain. Large Language Model (LLM) and Diffusion Model are two types of huge networks. These models are trained on billions of examples from the internet.

Feature	LLM (Large Language Model)	Diffusion Model
Main goal	Generate or understand text	Generate or edit images/videos
Input type	Words / sentences	Text (as a prompt) + random noise
Output type	Text (e.g., paragraphs, code, chat)	Visuals (e.g., images, videos)
How it learns	Predicts the next word	Learns to reverse noise and create images
Examples	ChatGPT, Claude, Gemini	Midjourney, Stable Diffusion, DALL·E, Runway, Kling

AI as a Classroom Analogy

Imagine a big school containing teachers (humans) and students (AIs).

The Teacher (Humans)

The teachers (humans) give the students (AIs) tons of examples: books, images, songs, videos, code — everything. The students don’t just memorize — they practice until they can do similar things on their own.

The LLM Student (AI)

One student, Lucy the LLM, loves reading and writing. She studies every book in the library and learns:

“After the words ‘Once upon a’, the next word is usually ‘time’.”

She becomes amazing at predicting the next word and can write essays, stories, or even hold a conversation — because she knows how words fit together.

Lucy = Large Language Model (writes and speaks intelligently).

The Diffusion Student (AI)

Another student, Danny the Diffusion Model, loves art. His training exercise:

The teacher shows him a picture.
Then they cover it with random paint splatters.
Danny learns to carefully “un-splatter” the image until it looks clear again.

After years of practice, Danny can now start with a blank canvas (just random dots) and, when you say “a cat wearing sunglasses,” he paints that from scratch.

Danny = Diffusion Model (paints images from words).

The Big Picture

Both Lucy and Danny are smart in different ways:

Lucy talks and writes (text world 🌍).
Danny paints and visualizes (image world 🎨).
They often work together — Lucy writes the idea, Danny draws it.

Types of AI

Category	Description	Example
Narrow AI	Specialized, task-based AI	ChatGPT
Generative AI	Creates new content	DALL·E, Runway
Agentic AI	Acts independently to achieve goals	AutoGPT, Devin
Analytical AI	Analyzes large data	Fraud detection
Predictive AI	Forecasts outcomes	Stock or weather models
Conversational AI	Talks with humans	ChatGPT, Siri
Robotic AI	Moves and interacts physically	Drones, factory robots

Machine Learning teaches AI patterns.
Generative AI teaches AI creativity.
Agentic AI teaches AI action and autonomy.

AI Leaderboard

There are many AI models by different creators, but some are more intelligent than others. This leaderboard ranks AI models based on intelligence score.

Some models are better than others at specific tasks. The following leaderboards compare leading models by

Here’s the current top-5 leaderboard for web development as of Nov 11, 2025.

AI, CPUs, and GPUs

Type	Analogy	Good For
CPU	A brilliant chef cooking one perfect dish at a time	Complex logic, single-thread tasks
GPU	A kitchen with 1,000 chefs making the same dish simultaneously	Massive parallel work (like AI math)

Originally, GPUs were built for graphics — drawing images and 3D scenes in video games.

CPUs (central processing units) are like smart workers — great at doing one task at a time, but carefully.
GPUs are like armies of workers — they can do thousands of small calculations at once.

What would take a CPU weeks to do, a GPU can do in hours.

AI depends heavily on GPUS.

Nvidia, the company that makes most AI GPUs (like the H100 and A100), has become one of the most valuable companies in the world because of AI demand.

Data centers around the world are being built specifically to host GPU farms — giant rooms filled with thousands of GPUs that power AI models.

Cloud providers (like AWS, RunPod, or Google Cloud) rent out GPU power so smaller developers can build and test AI apps without owning hardware.

AI Training VS Inference

There are two main stages of AI:

1. Training (Learning)

This is when the AI learns from data. For example, teaching an AI to recognize cats by showing it millions of cat photos. It requires huge amounts of computation and it needs massive GPU clusters — sometimes thousands of GPUs working together for weeks or months.

Think of it like “going to school.”

2. Inference (Using What It Learned)

Once trained, the AI can now use what it knows — answering questions, generating images, etc. It still uses GPUs, but fewer — since it’s now recalling knowledge rather than learning it.

Think of it like “taking an exam” — it’s using what it learned efficiently.

AI Hallucinations

An AI hallucination is when an AI makes up something false but presents it as true.
It happens because AIs don’t know truth — they just predict what sounds right based on pattern-matching. For example,

AI Type	What “Hallucination” Looks Like
LLM (ChatGPT, Claude)	Makes up fake facts, quotes, sources, or people
Image Model (Midjourney, DALL·E)	Adds random visual details that weren’t in the prompt (like extra fingers 👋 or distorted objects)
Video / Audio Models	Create unrealistic motion, or mis-sync voices and faces

Hugging Face

Hugging Face is basically the GitHub of Artificial Intelligence. It’s a giant online platform where people share, explore, and use AI models, datasets, and tools — all in one place. At Hugging Face, you’ll find

Section	What It Offers	Example
Models	Pre-trained AI models (text, image, audio, etc.)	ChatGPT-like LLMs, Stable Diffusion, Whisper
Datasets	Large collections of data used to train AIs	Wikipedia text, image caption sets, code samples
Spaces	Interactive apps people build with AI	You can test image generators, chatbots, translators
Transformers Library	Hugging Face’s open-source code that makes it easy to use models	Used by researchers, developers, and hobbyists everywhere

Imagine you want to build an app that turns spoken words into text, translates it to French, and then summarizes it. On Hugging Face, you can:

Search for a speech-to-text model (like OpenAI’s Whisper).
Add a translation model (like Helsinki-NLP English-to-French).
Plug in a summarizer (like BART or T5).
Run it all with just a few lines of Python code using the transformers library.

You don’t have to train anything from scratch — it’s all there, ready to go.

Hugging Face Spaces is like GitHub Pages – it’s where developers can host their AI models and provide a simple UI using Gradio, a simple web interface to demo an AI app. For example, you can search for “text to image”, click on a result, like Gemini Image Generator, and test the model in a browser.

You can also use models in your own code using Python. For example,

Install the transformers library:

pip install transformers

Load a model (for example, a text generator):

from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
result = generator("Once upon a time, there was a cat", max_length=30)
print(result[0]["generated_text"])

ComfyUI

ComfyUI is a visual, node-based interface for creating AI images and videos. It’s a drag-and-drop app that lets you build your own AI image or video generator. Instead of typing long code, you build a “workflow” by connecting blocks called nodes. Each node does one thing:

One node loads your model
One node reads your text prompt
One node generates an image
Another node might upscale, add depth, or edit colors

You can run ComfyUI locally on your computer, but you need a lot of space and a powerful computer, preferably with a GPU. These computers are expensive. Alternatively, you can run ComfyUI in a browser at ComfyCloud, RunComfy, and ThinkDiffusion, where you can rent powerful GPUs and pay only for when you’re using the service.

Image and Video AI Playground

You can test some of the leading image and video AI models at RunComfy’s playground.

Amazingly Consistent AI-Generated Characters From a Photo Using SeeDream 4.0

After testing various AI models to generate images of a character based on one or two reference character images, I have found SeeDream 4.0 to be the best. I’ve even found it to be much better than Google’s Nano Banana for this purpose. Here are some example SeeDream-generated images from two photos of me.

My Reference Photos

Prompts + Generated Images

Note: With consistent character images, you can bring them into Kling AI as references to generate consistent-character videos.

Photorealistic cinematic medium shot of the man skiing, face uncovered, wearing his black hat and sunglasses, snow crystals clinging to his eyebrows and beard, breath visible in the cold air, cheeks flushed from exertion, eyes focused ahead, crisp alpine sunlight illuminating his face — ultra-realistic GoPro portrait with shallow depth of field and cinematic lighting. His entire face is visible.

Wide shot of the same man skiing down a pristine mountain slope, wearing a black baseball cap and semi-transparent gradient sunglasses. The camera faces him from the front, capturing his body leaning into a turn, powder snow flying, blue sky above, cinematic lighting, GoPro ultra-wide perspective.

Cinematic chest-up shot of the man wearing a black baseball cap forward and gradient sunglasses, ocean droplets on his face and shoulders, sunlight reflecting in his lenses, deep blue waves behind him, captured from a GoPro mounted near the surfboard’s nose.

Wide front-side GoPro shot of the man surfing a curling turquoise wave, wearing his black baseball cap and gradient sunglasses, camera facing him from low on the wave surface, dynamic water spray and cinematic sunlight.

Chest-up shot of the man free-falling through a bright blue sky, wearing a black baseball cap forward and gradient sunglasses, wind sweeping his jacket and straps, sunlight reflecting off his glasses, clouds far below, cinematic color and realism.

Wide cinematic shot of the man skydiving, body stretched out and stable, camera facing him from the front at mid-distance, black cap and sunglasses clearly visible, clouds and the earth below in vast perspective, soft sun haze.

Chest-up shot of the man riding a sports motorcycle, wearing a black baseball cap facing forward and gradient sunglasses, wind flowing through his clothes, warm golden-hour lighting, GoPro mounted near the handlebars, cinematic clarity.

Wide front-side GoPro shot of the man on a motorcycle, black cap and gradient sunglasses visible, leaning slightly into a curve, mountain road ahead, trees and sunlight framing the motion, cinematic outdoor tone.

Wide underwater GoPro shot of the man swimming near a massive whale, camera facing him from the front, black swim cap and swimming goggles visible, sunlight streaming through the blue water, cinematic and peaceful tone.

Cinematic chest-up side-profile shot of a man in a helicopter cockpit, camera positioned slightly in front and to the left of him, showing the cockpit controls in front of him and his face in profile as he pilots, wearing a black baseball cap facing forward and semi-transparent gradient sunglasses, sunlight streaming through the windshield, clouds ahead, highly realistic lighting and detail.

Wide front-side shot from outside the helicopter, camera hovering just ahead of the cockpit, showing the man’s face through the glass, wearing a black baseball cap and gradient sunglasses, mountain clouds around the aircraft, cinematic aerial composition.

Cinematic chest-up side-angle shot of a man actively climbing a rugged rock wall, camera positioned slightly below and to his right, angled upward toward his face. He’s wearing a black baseball cap facing forward, semi-transparent gradient sunglasses, and mountain-climbing gear with a harness, rope, and gloves. His left hand grips a rock hold firmly while his right hand reaches upward, sunlight illuminating his face and the textured stone surface. Sharp, realistic detail with warm mountain tones.

Wide front-side GoPro shot showing the man mid-climb on a tall cliff face, wearing a black baseball cap, gradient sunglasses, and full climbing gear. Camera positioned several meters away at a diagonal angle showing his full body against a vast mountain backdrop. His left leg is bent, right leg extended, and the valley below stretches into hazy sunlight. Cinematic scale and clarity.

Chest-up shot from a GoPro mounted near the motorcycle’s front handlebars, camera angled slightly upward toward the man’s face as he rides. He’s wearing a black baseball cap facing forward, gradient sunglasses, and gloves. Dust trails swirl behind him under golden desert sunlight, with blurred dunes in the background. Cinematic and ultra-realistic motion energy.

Wide front-side shot of the man speeding across desert dunes on a dirt bike, camera hovering ahead and slightly to the right of the vehicle. He wears a black cap, gradient sunglasses, and gloves. Sand sprays from his back tire as he leans into a curve, sunlight casting long shadows across rippling dunes. Cinematic wide desert action.

Chest-up shot from a GoPro mounted on the kayak’s bow, camera angled back toward the man’s upper body and face. He’s wearing a black baseball cap forward, gradient sunglasses, a water-resistant jacket, and a life vest. His arms are mid-stroke with a double-bladed paddle, water splashing upward, sunlight sparkling on droplets, dynamic river motion, cinematic realism.

Wide front-side GoPro shot from low on the river surface, camera facing the man as he navigates a turbulent section of rapids. He’s wearing his black cap and gradient sunglasses, gripping the paddle firmly, kayak tilted slightly as water churns around him, cinematic sunlight filtering through mist.

Chest-up shot from a GoPro mounted on the paraglider harness bar, camera angled slightly upward toward the man’s face. He’s wearing a black baseball cap facing forward, gradient sunglasses, a light windbreaker, and safety harness straps. Clouds drift behind him, sunlight flaring through, mountain peaks faintly visible below, cinematic aerial serenity.

Drone shot showing the man from the chest up as he hikes through a dense green forest. The camera hovers at eye level slightly ahead of him, capturing his focused face under a black baseball cap and gradient sunglasses. Sunlight filters through tree leaves, creating a shadow across his jacket and backpack straps, cinematic natural lighting.

Wide drone shot slightly elevated in front of the man as he walks between tall pine trees. His cap and sunglasses reflect the early morning light while his backpack swings gently with his steps. Mist floats through the shafts of sunlight, cinematic forest depth.

Drone hovering close to the cliff edge on the side of the man, capturing his profile from the chest up as he stands poised ready to jump from above the turquoise ocean below. His black cap and gradient sunglasses face forward. Behind him, the vast sea glimmers beneath a bright sky, cinematic tension and anticipation.

Wide drone shot halfway between the top of a cliff and the ocean below showing the side/profile of the man mid-air after jumping from the cliff, face down, feet up, and body stretched toward the turquoise ocean water below. His face and cap visible from the side, cinematic natural lighting and energy.

Drone hovering ahead of the man, showing him from the chest up as he snowboards down a slope. He wears a black cap, gradient sunglasses, insulated jacket, and gloves. Snow trails burst behind him as he cuts through the powder under a bright winter sky, cinematic frozen landscape.

Drone shot showing the man from the chest up as he wakeboards across a shimmering blue lake. The drone hovers ahead and slightly above, facing him directly. He wears a black cap, gradient sunglasses, and life vest, holding the tow rope tightly, droplets splashing midair, cinematic high-speed energy.

Wide drone shot showing the man wakeboarding across open water, his body leaning slightly as he balances on the board. The drone moves ahead and to the side, keeping his face visible as water trails behind him in sparkling arcs, cinematic energy and clarity. The drone is not visible in the image. He wears a black cap, gradient sunglasses, and life vest.

Drone hovering ahead of the man at chest height as he runs across a wide desert plain. His black baseball cap and gradient sunglasses stand out against the golden dunes. He wears a sleeveless athletic shirt, sweat and dust on his skin, sunlight glowing in orange hues, cinematic desert light.

Drone hovering ahead of the man at chest height as he pedals up a rugged mountain trail, black baseball cap facing forward and gradient sunglasses catching the sunlight. His shirt and arms are dusted from the trail, and trees blur behind him as sunlight filters through pine branches, cinematic outdoor realism.

Wide drone shot from a front-side angle showing the man biking along a narrow ridge path surrounded by green valleys below. His black cap and sunglasses glint in the sun as the trail twists ahead, cinematic adventure composition.

Drone hovering ahead of the man, chest-up shot showing him ziplining through a lush tropical jungle. He wears a black baseball cap and gradient sunglasses, his shirt rippling from the wind, treetops blurring behind him, cinematic motion and sunlight streaming through green canopy.

Wide drone shot capturing the man ziplining high over dense jungle canopy, seen from a front angle that shows his face. His cap and sunglasses are visible, body extended midair, forest stretching below in cinematic scale. The drone is not visible in the image.

Drone hovering just above the sailboat deck, chest-up view of the man at the helm steering through calm blue water. He wears a black baseball cap and gradient sunglasses, soft wind rippling his shirt, ocean horizon stretching behind him, cinematic ocean serenity.

Cinematic chest-up shot of the man standing inside a colorful hot air balloon basket. The camera is positioned level with his chest, facing his face at a 45-degree angle from the front as the balloon floats above scattered clouds. He wears a black baseball cap and gradient sunglasses, his face softly illuminated by golden morning light. The colorful canopy rises above him, adding visual warmth and adventure.

Cinematic chest-up shot of a man standing on the deck of a fishing boat, holding a fishing rod angled toward the sea. The camera is positioned at chest height, facing him slightly from the side so that both his expression and the ocean behind him are visible. He wears a black baseball cap and gradient sunglasses, wind brushing against his shirt, sunlight reflecting on the waves, cinematic natural realism.

Cinematic chest-up shot of a man riding a horse across an open golden plain at sunset. The camera faces him slightly from the front at chest level. He wears a black baseball cap forward, gradient sunglasses, and short hair (no long hair at the back). The sunlight warms his face, cinematic natural beauty and motion.

Wide cinematic shot showing the man on horseback galloping across the open field, the horizon glowing orange behind him. His face is visible under a black cap and sunglasses, short hair beneath. Dust and grass swirl around, cinematic sweeping grandeur.

Cinematic chest-up shot of the man driving a vintage convertible along a coastal highway. The camera faces him from the passenger side, sunlight reflecting off his gradient sunglasses. He wears a black baseball cap forward and short hair, no long hair at the back. The ocean glimmers beyond the road, cinematic freedom and motion.

Cinematic chest-up shot of a man sandboarding down a tall golden dune, even though only his upper body is visible. The camera faces him from the front at chest level, capturing grains of sand flying around. He wears a black baseball cap forward, gradient sunglasses, and has short hair. His shirt ripples in the desert wind under warm sunset light, cinematic thrill and freedom.

Wide cinematic shot of the man carving across a massive dune while standing on a sandboard, his body angled into the slope as sand billows behind. The camera faces him from the front-side, 45 degrees from the front of his face, showing his all-black cap, and sunglasses illuminated by the sunset glow, cinematic desert scale. He has no hair at the back of his head.

Cinematic chest-up shot of a man trekking a rugged volcanic ridge, visible from chest up with his hiking backpack straps showing. The camera faces him slightly from below. He wears a black baseball cap, gradient sunglasses, and a determined look. Faint volcanic steam rises behind him, cinematic endurance and heat.

Wide cinematic shot from the side of the man walking along the sharp crest of a volcanic ridge. His face is visible under the black cap and sunglasses, catching the backlight, cinematic scale and strength.

Wide cinematic shot of a dune buggy going up the side of a steep sand dune, camera facing the front corner of the vehicle. The man’s face is visible through the open window, cap and sunglasses reflecting sunlight, cinematic rugged adventure.

Wide cinematic shot of the man crossing the middle of a long bridge stretched over a deep canyon, facing forward. His cap and sunglasses glint in sunlight, cinematic height and bravery.

Cinematic chest-up shot of a man riding a jet ski across turquoise tropical water, visible from the waist up with the handlebar in frame. He wears a black baseball cap, gradient sunglasses, and a determined smile as spray splashes around, cinematic speed and excitement.

Wide cinematic side/profile shot of the man cutting across tropical turquoise water on his jet ski, small islands in the background. His face remains visible as sunlight glints off his sunglasses, cinematic power and summer vibrance.

Cinematic chest-up underwater shot of a man swimming alongside dolphins, visible from the shoulders up. He wears a black swimming cap, gradient sunglasses, and a joyful expression. Dolphins swim beside him as sunlight patterns ripple across his face, cinematic harmony and wonder.

Cinematic chest-up shot of a man standing on a paddleboard on a still mountain lake, visible from chest to hat top. He’s holding a paddle. He wears a life vest, black baseball cap, gradient sunglasses, and calm focus, reflections of clouds in the water, cinematic serenity.

Wide cinematic side shot showing the man paddleboarding toward distant mountains on mirrored water. The entire paddleboard and the man standing on it are visible in the side profile shot. His face is visible as he looks to the camera, cap and sunglasses glinting in the sunlight, cinematic peaceful scale.

Cinematic chest-up shot of a man wearing a wingsuit mid-flight, camera facing him slightly from below to show air rushing around. He wears a black baseball cap, gradient sunglasses, with parts of the red wingsuit visible at his shoulders. Vast cliffs blur behind, cinematic thrill and velocity.

Cinematic chest-up shot of a man riding a camel across golden desert dunes, only his upper body and part of the camel’s saddle visible. He wears a black baseball cap, gradient sunglasses, and relaxed posture under warm sunlight. The horizon glows orange, cinematic calm and travel spirit.

Cinematic chest-up shot of a man ascending Mount Everest in climbing gear, rope visible over his shoulder, breath misting in the cold air. He wears a black beanie, gradient sunglasses, expression focused with snowflakes clinging to his jacket. Harsh sunlight reflects off snow, cinematic struggle and determination. He is not wearing a baseball cap.

Wide cinematic shot of the man scaling an icy slope near Everest’s peak, rope anchored below, prayer flags fluttering in the distance. His face is visible. He is wearing climbing gear, rope visible over his shoulder, thick gloves, a black beanie and sunglasses, cinematic grandeur and persistence. He is not wearing a baseball cap.

Cinematic chest-up shot of a man floating in space tethered to the International Space Station, wearing a detailed astronaut suit with a large glass helmet revealing his full face, he’s wearing a thin black beanie covering his hair and black sunglasses with a semi-transparent gradient lens. Earth glows blue below, cinematic awe and isolation. He is not wearing a baseball cap.

Cinematic chest-up shot of a man seated in a luxurious first-class cabin of an A380 commercial jet, eating a gourmet meal beside the window. He wears a black baseball cap, gradient sunglasses, short hair, relaxed smile, soft daylight illuminating his face and tableware. Background shows blurred clouds outside, cinematic comfort and refinement.

How to Make an AI Video With Consistent Characters

In this post, I’ll show how I made this AI video containing a consistent main character.

Note: as I made this video, I kept finding and learning better ways to make it, so this video is not as good as could be had I started over using the techniques in this post.

Decide on a theme/setting

I wanted to make a music video using the song “When You Believe” from the movie “The Prince of Egypt”. For that reason, I chose ancient Egypt for my setting.

Create your main character

I wanted to see if I could create an AI version of me with changes to suit my desired main character, i.e., I wanted the character’s face to look as close as possible to me, but have other features like physique be more athletic and sculpted. To create the main character, I chose a picture of me that I liked and I asked ChatGPT to give me a prompt to create an image of a person that matches an example image I found.

ChatGPT analyzed the outfit in the image I uploaded and gave me a prompt to generate an image with that outfit applied to a reference image I would upload. For the reference image, I chose this picture of me.

In OpenArt, I did the following

Image > Create Image
Model = Nano Banana
Uploaded my reference image to the Omni Reference input
Clicked “Create”

Nano Banana gave me this:

It was good, but I ended up making various changes until I got this:

This final character image is based on another photo of mine. Tweaking the character photo was a lot more difficult than it should have been. I had to tell Nano Banana to make my arms a little darker and have me facing straight at the camera. When you create your main character, whether from a photo of a real person or not, try to have the character standing straight and facing the camera with a neutral facial expression. In my image above, I have a little smile, which kept getting carried over into various AI image generations. To fix this, I had to update every prompt that used this image to include the comment, “Subject is not smiling”. Also, AI generations would often remove my sunglasses, even though my main character reference image includes sunglasses. To fix this, I had to update every prompt that used this image to include the comment, “Subject is wearing glasses or his sunglasses.”

Note: Make sure you are satisfied with your main character image because you will use it to generate many images and videos with it as a reference.

Decide on your video content structure

In my case, I didn’t want to show viewers the main character non-stop throughout the video, so I decided to have alternating clips showing clips with and without the main character, i.e.

Scene 1	Video clip without the main character
Scene 2	Video clip with the main character
Scene 3	Video clip without the main character
Scene 4	Video clip with the main character
Scene 5	Video clip without the main character
Scene 6	Video clip with the main character
…	…

Generate image and video prompts

Generating images is much cheaper than generating videos. For example, in OpenArt, here’s what I paid in credits for each type:

Image using Flux Pro model	5 credits
Image using Nano Banana model	16 credits
Video using Kling 2.5 model	100 credits

For this reason, I created an image of the first frame of each scene (video clip) before creating each video clip. If I didn’t like an image that was generated, I could regenerate it, which is much cheaper than regenerating a video.

For the scenes without the main character, I didn’t need a reference image, so I used the Flux Pro model, which was cheap (5 credits) and produced great results.

For the scenes containing the main character, I needed to upload my reference image. Flux Pro doesn’t support reference images, so I used the Nano Banana model, which does support reference images.

UPDATE: actually, SeeDream 4.0 is just as good as Nano Banana. Use both SeeDream and Nano Banana, then choose which result you prefer.

For video generation, I chose Kling 2.5 because it supports a first-frame image option and produced good results.

Since we want the final video not to show a bunch of random content, I had each pair of consecutive scenes with and without the main character be relevant to each other. To do this, and to generate detailed prompts that matched my theme and setting, I used ChatGPT. For example, I asked ChatGPT the following:

Give me 10 pairs of cinematic video prompts along with their first-frame image prompt. For each pair, the first set of image and video prompts will not include my main character reference image, and the second set of image prompts will include my main character reference image.

I will use Flux Pro to generate images that don’t include the main character, I will use Nano Banana to generate images with the main character reference image, and I will use Kling 2.5 to generate each video using the corresponding image as the first frame. Every detail that will be referenced in the video prompt should be included in the image prompt.

Since I am making a music video of the song “When You Believe – The Prince of Egypt”, the theme/setting of the scenes should be ancient Egypt. The scenes should include popular landmarks, like the pyramids and the Nile River, people in a busy market, ancient mosques, and, of course, a royal palace for the prince. Each pair of prompts with and without the main character should be relevant to each other so as to flow as a story. If possible, all scenes should flow together as a cohesive story.

After answering a few follow-up questions from ChatGPT, it gave me the prompts, like the following:

1. The Desert Procession (without the prince)

Image Prompt (Nano Banana)
Epic realism, ancient Egyptian soldiers and servants walking through golden desert dunes at dusk, carrying ornate banners and treasures toward a distant palace on the horizon. The sky glows crimson-orange, sand whipping lightly in wind, dust haze around the caravan, footprints trailing through rippled dunes. Cinematic depth, ultra-detailed, 50mm lens realism, warm atmospheric glow.

Video Prompt (Kling)
Dynamic tracking shot alongside the caravan; camera glides parallel to marching servants, banners flapping, sand swirling, sunlight catching on gold ornaments; as the procession nears the palace, a subtle lens flare crosses frame, suggesting the arrival of royalty; ambient desert wind and drum rhythm underscore the movement.

2. The Prince’s Arrival (with the prince)

Image Prompt (Nano Banana)
Epic realism, regal Egyptian prince wearing a black tunic with gold trim, ornate jewelry, and a linen cloak, descending from his chariot before the grand palace gates. Behind him, the same caravan from previous scene arrives in the background. The palace gleams under the last light of sunset, palm trees swaying gently, atmosphere majestic and warm. The prince’s face is clearly visible, confident and composed, cinematic lighting on his features, 50mm shallow depth.

Video Prompt (Kling)
Camera begins low behind the prince’s boots stepping onto polished stone, then cranes upward to reveal his full figure framed against the glowing palace. Cloak moves naturally in breeze, chariot wheels still turning in background, banners from previous scene flutter in golden light. Subtle push-in as the prince lifts his gaze toward the palace — the soundtrack swells, symbolizing destiny and return.

Generate Images and Videos

Now that we have our inputs (prompts and main character image), we can generate images and videos for each scene.

For the images without the main character, I used Flux Pro like this

In this example, I got the following image:

I then used that image as the first frame to generate a video using Kling 2.5 as follows:

This produced the following video:

For the images with the main character, I used Nano Banana like this:

Notice that I uploaded my main character reference image in the Omni Reference field. I also uploaded the image from the previous prompt showing the caravan since this prompt references it. This prompt generated the following image:

Again, I then used that image as the first frame to generate a video using Kling 2.5 as follows:

which generated the following video:

Repeat for all prompts

I then just repeated the previous image and video generation step for all the prompts that ChatGPT gave me.

Create final video

Lastly, I added all assets (video clips, music) to a video editing to (I used Capcut), put them on the timeline in my desired sequence, and exported the final video.

Create a poster image

Since I published my video on YouTube, I wanted an attractive thumbnail / poster image. I scrubbed through the video in the timeline in Capcut until I found a frame that I like, and then exported that frame as a still image. I then added the title of the song (“When You Believe”) to it in Photoshop. This is the poster image I ended up with.

How to Make an AI Music Video

In this post, I’ll explain how to make a music video using AI like the one below.

Get a song

The first step is to get a song. Here are some options:

Buy a song, e.g., on Amazon
Download a song from YouTube, e.g., using a YouTube music downloader
Create a song using AI, e.g., using Suno or TopMediAI

I chose the song “California Love” and I had it in mp3 format.

Optionally, separate vocals from backing track

Sometimes, when you upload a song to generate a lipsync video, the tool can’t distinguish between the vocals and the backing track, resulting in imperfect lip movements. For example, using this song segment

we see that the lipsync results near the end are inaccurate.

If you created the song using Suno, TopMediAI, or any other AI music generation tool, you may be able to cleanly separate the vocal track. Look for “get stems”. Otherwise, you can try to separate the vocals from the backing track using UVR Online.

Here’s how the same lipsync video looks using this vocal track separated using UVR Online.

In this case, the lip movements at the end look accurate, but the beginning looks imperfect, which is why it’s better to get the vocal track from the source.

Get a photo of your singer

In the video above, I wanted to be the singer, so I found a clear, front-facing photo of myself. I then removed the background. Make sure to have a clear photo for better results.

Generate AI prompts for different scenes of the music video

When it comes to generating AI prompts, I find it easier to use AI (I use ChatGPT) to generate the prompts. Later, we will use AI to generate a lip sync of the singer singing the song. Unfortunately, at the time of this writing, AI is not good enough to lip sync images of subjects where their faces are too small, so for the AI prompts, we want to tell ChatGPT to show a close-up of the subject such that their face occupies 1/3 or 1/2 the image height.

Here’s an example prompt to generate a list of prompts.

Give me 20 prompts to generate 20 images using AI with the following criteria:

the generated images should be relevant to the theme or lyrics of the song “California Love” by 2pac, ft. Dr. Dre.
the singer in the generated images will come from an uploaded reference image
the singer should be shown close up from the waist up so that an AI lip sync tool will produce better results
the singer should be either facing the camera or at most facing 45 degrees from the camera

ChatGPT may ask for more information or inputs, like the image of the singer or the lyrics to the song.

Here is example output from ChatGPT.

It’s important to ensure the face of the subject is clear so the lip-sync video generation results are good. Here’s a comparison of video results from images with a small and large subject face.

Subject is not close up

Subject is close up

Subject is not close up

Subject is close up

Notice how the videos where the subject is not close up show facial distortion. The videos with the subject close up show much less distortion. The closer the subject (extreme close up), the less the distortion, e.g.,

Extreme close up

Generate still images using the prompts and the singer image

I like to use OpenArt and Google Nano Banana to generate the images. In OpenArt, go to

Image > Create Image
Model: Choose the “Nano Banana” model (I prefer this model for this purpose)
Prompt: enter one of the prompts from the previous step
Omni Reference: upload the image of the singer from the previous step (you can optionally upload more than one image, e.g., a front-facing image, a side-facing/profile image, etc)
Output Size: I prefer 16:9
Upscale output: x2 (if you choose x4, the image will be too large for TopMediAI – the lipsync tool – so I choose x2)
Number of images: 1 (you can choose more than one, but you’ll pay more. I prefer to start with one, and if the results are okay, I’ll regenerate more images to get a variety)

Here’s an example of the interface.

After you do this for each prompt, you should end up with an array of images.

Note that you will likely need to tweak some of the prompt text and regenerate images until you get results you like.

Organize files

Since you’ll be working with many files, you’ll want to organize them so you don’t get lost and make mistakes. I like to put all files in a folder like this

“California Love” (folder)

california-love.mp3
img-01.jpg
img-02.jpg
img-03.jpg
…

Split the song in a video editor

I like to use Capcut for video editing. Import the audio file and all image files to Capcut.

Add the audio file to the audio track
Play the video (audio only at this point) and place a marker where you want different scenes to appear
Listen to the audio and move the marker so that it is located between words or sentences.
Split the audio track at the marker
Repeat for as many scenes (clips) as you want

Add an image to each audio clip

On a video track, add an image for each audio clip such.
Set the duration of the image clip to match the duration of the audio clip by dragging either end of the image clip to the left or right.
Play the video in the preview window to see if you like the image sequence for the associated music.

Export the audio clips

If you like the preview in the previous step, export each audio segment in mp3 format. I use the left and right arrow keys to jump the playbar to the start or end of each segment and then I hit the “i” key to set the start point and the “o” key to set the end point so that I only export the segment that I want. The other segments will be grayed out.

Click the “Export” button and check only the “Audio” checkbox. Name each export with a number corresponding to its order in the timeline.

When done, your folder structure should look like this:

“California Love” (folder)

audio-001.mp3
audio-002.mp3
audio-003.mp3
…
california-love.mp3
img-01.jpg
img-02.jpg
img-03.jpg
…

Create a lip-sync video from each image clip

There are many AI lip-sync tools, like HeyGen. HeyGen is used 3D modeling, which is good, but the output is more for talking rather than singing. As such, only the subject’s head moves. For the video above, I used TopMediAI. Though it only uses 2D modeling, it does an impressive job at animating an entire image, including the background, the subject’s head, and the subject’s body.

Log in to TopMediAI
Music AI > Music Tools > Singing Photo Maker
Drag/upload audio segment 1 (audio-01.mp3)
Drag/upload image 1 (img-01.jpg)
Click the “Create” button

When generation is done, preview the lip sync video. If it’s good, save the video as “video-01.mp4” and repeat for all other images. When done, your file structure should look like this:

“California Love” (folder)

audio-001.mp3
audio-002.mp3
audio-003.mp3
…
california-love.mp3
img-01.jpg
img-02.jpg
img-03.jpg
…
video-01.mp4
video-02.mp4
video-03.mp4
…

Add the lip-sync videos to the video timeline

Import all lip-sync videos to Capcut
Drag each lip-sync video to a new video track above the track containing images, making sure the video matches the image.
Optionally add transitions, e.g., fade in/out audio or still images, crossfade between video clips, etc
Disable the audio in the lip-sync video track so that only the audio from the music track is heard.
Preview the entire video

If everything looks good, export the video. Click the “Export “button, check the “Video” checkbox, and set parameters (4K resolution, etc).

B-roll

Many videos include B-roll, which is filler content. In my music video above, my B-roll includes video footage of various iconic locations in California. To create this footage, I did the following:

Asked ChatGPT to give me a list of the top 20 most iconic landmarks in California
Asked ChatGPT to give me an image and video prompt for each of those 20 landmarks
Used OpenArt > Image > Google Nano Banana to generate an image using the image prompts from the previous step. Generating images is much cheaper than generating videos, so I prefer to generate images first. If I like an image, I use it as a reference to create a video.
Used OpenArt > Video > From Image to create a video from the images and video prompts generated in the previous steps.
Added the videos to the timeline in Capcut.

If you want to create B-roll footage using the same character, the following process produced good results.

Go to OpenArt > Video > Elements to Video
Select the Vidu Q2 model (you can use other models, like Google Veo 3, but they can be way more expensive)
Upload your character and optionally some other reference images
Enter a text description of the video scene
Choose other settings (aspect ratio, duration, resolution)
Click “Create”

For example, using this character image

and this description text:

Wide slow-tracking shot following the woman as she walks along glowing dunes. Fire lines cross behind; embers trail — cinematic motion, 8K realism.

I was able to create this video

and many other ones using the same character.

Use AI to Remove Objects / People From a Video

There are many tools for removing objects and people from a video, including Adobe After Effects. So far, the easiest one I’ve found is Fotor. You can also use it to remove watermarks on videos.

Upload your video
Click the “Remove Object” button
Paint over the objects you want to remove (you can adjust the size of the paintbrush, if needed)
Click ‘Generate Preview” to see how it looks. If it looks good, you can download the generated video if you are a paying member.

Here’s an example. Below is a video clip from a music video.

I wanted to remove the 2 people, which I identified in Fotor using the purple paintbrush.

Here’s the resulting video.

Use Runway Act-Two (AI) To Animate Characters Based on an Existing Video

Usually, people create AI videos by describing what they want using a text prompt. This can be very difficult depending on the results you are looking for. Another way to create an AI video is by creating a “driving performance” video, which shows what movements you want to mimic. For example, if you want to make a video of yourself dancing and lip syncing exactly like someone else in an existing video, you can upload the existing video as a “driving performance” video and upload an image of yourself as the “character image”. This post explains how to do it using Runway Act-Two.

In Runway, click on Act-Two. You will see the UI below.

In the top section, you upload a “driving performance” video, which will contain the body movements and facial gestures you want to copy and apply to your character in the bottom section.

In the bottom section, your character can come from an image or video that you upload. For simplicity, and to match the driving video, I will upload an image containing my character.

For demonstration purposes, I want to make myself sing and dance exactly like the subject in the following video.

Runway Act-Two provides the following recommendations for driving performance videos.

Feature a single subject in the video
Ensure the subject’s face remains visible throughout the video
Frame the subject, at furthest, from the waist up
Well-lit, with defined facial features and expressions
- Certain expressions, such as sticking out a tongue, are not supported
No cuts that interrupt the shot
Ensure the performance follows our Trust & Safety standards
[Gestures] Ensure that the subject’s hands are in-frame at the start of the video
[Gestures] Start in a similar pose to your character input for the best results
[Gestures] Opt for natural movement rather than excessive or abrupt movement

Since my “driving performance” video has people playing music in the background, I need to remove them. One way is by using Capcut’s Auto Background Removal feature.

While it’s not perfect, it may be sufficient for Runway’s Act-Two. Here are two other AI-based video background removal tools that seem to do a better job.

Unscreen

VideoBgRemover.com

VideoBgRemover.com seems to produce the best results. However, for this test, I used the imperfect results from Capcut.

If you need more accurate subject isolation or background removal, use Adobe AfterEffects’ RotoBrush feature.

DRIVING PERFORMANCE VIDEO

CHARACTER IMAGE

Runway Act-Two provides the following recommendations for character images.

Feature a single subject
Frame the subject, at furthest, from the waist up
Subject has defined facial features (such as mouth and eyes)
Ensure the image follows our Trust & Safety standards
[Gestures] Ensure that the subject’s hands are in-frame at the start of the video