MuseTalk for Real-Time Lip-Sync Avatars: Complete GPU Installation & Fixes

MuseTalk is a real-time talking-face system that combines pose estimation, voice analysis, and lip-sync. Getting it running smoothly on a cloud GPU like RunPod (NVIDIA A5000) can take a few tweaks. Below is the exact process that worked for me-plus the fixes for the common snags I hit (PyTorch/CUDA mismatch, missing huggingface-cli, FFmpeg path issues, and incomplete weight downloads).

🖥️ 1) RunPod Environment

GPU: NVIDIA A5000
Template: sub38-pod
Base Image: runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel

Clone the repo:

git clone https://github.com/TMElyralab/MuseTalk.git
cd MuseTalk

📦 2) (If present) Activate the prebuilt virtualenv

On this image, there’s a precreated environment at /muse_env. Activate it before installing anything so tools land in the right place:

python3 -m venv muse_env
source muse_env/bin/activate
which python
# e.g. /muse_env/bin/python

If /muse_env doesn’t exist in your template, create your own venv instead and use that consistently.

⚙️ 3) Install Python dependencies

pip install -r requirements.txt

This generally succeeds, but later you must ensure PyTorch and CUDA really match.

🧠 4) Fix PyTorch ↔ CUDA compatibility (cu118)

The base image ships with PyTorch 2.0.1 (CU118). Verify:

python -c "import torch; print(torch.__version__, torch.version.cuda)"
# Expect something like: 2.0.1 11.8

If it’s not CUDA 11.8 (e.g., you accidentally pulled CU12 wheels), pin the correct builds:

pip uninstall -y torch torchvision torchaudio
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 \
  -f https://download.pytorch.org/whl/torch_stable.html

✅ Result: PyTorch and CUDA align with the image (11.8).

🧩 5) Install OpenMMLab components (MMPose)

MuseTalk relies on MMPose. Use mim:

pip install --no-cache-dir -U openmim
mim install mmengine
mim install "mmcv==2.0.1"
mim install "mmdet==3.1.0"
mim install "mmpose==1.1.0"

✅ Result: MMCV & MMPose installed and compiled against your current stack.

🎬 6) Add FFmpeg (static build)

The base image usually lacks FFmpeg. Fetch a static build and expose it on PATH:

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar -xvf ffmpeg-release-amd64-static.tar.xz
export FFMPEG_PATH="$PWD/ffmpeg-7.0.2-amd64-static"
export PATH="$FFMPEG_PATH:$PATH"

ffmpeg -version
# should print version info

(If your extracted folder name differs, update FFMPEG_PATH accordingly.)

💾 7) Download the model weights

MuseTalk ships a helper script:

bash ./download_weights.sh

If you see: `huggingface-cli: command not found`

Root cause: Newer huggingface_hub (>=1.0) removed the legacy CLI entry point.

Fix: install a CLI-providing version and keep it pinned

pip install "huggingface_hub[cli]==0.25.2" --force-reinstall
which huggingface-cli
# e.g. /muse_env/bin/huggingface-cli

Prevent auto-upgrades inside the script: open download_weights.sh and comment out the self-update line:

# pip install -U "huggingface_hub[cli]"   # keep pinned CLI working

Re-run:

bash ./download_weights.sh

🔍 8) Verify weights

du -h models
# e.g. ~8.7G  models

Typical artifacts include:

musetalk/*
musetalkV15/unet.pth
sd-vae/diffusion_pytorch_model.bin
whisper/pytorch_model.bin
dwpose/dw-ll_ucoco_384.pth
syncnet/latentsync_syncnet.pt

✅ Result: All weights present (≈8.7 GB total).

▶️ 9) Run MuseTalk

python app.py

You can also run a one-shot inference:

bash inference.sh
# or:
python app.py --input_video sample.mp4 --input_audio sample.wav --output output.mp4

✅ Result: Generates a realistic talking-face video from your inputs.

🧩 10) Running the Real-Time Pipeline (MuseTalk v1.5)

After MuseTalk was successfully installed and verified, we moved from the one-shot demo (python app.py) to the real-time inference pipeline designed for faster lip-sync generation – suitable for live or interactive AI avatar applications.

🧠 Step 1 – Configure your avatar

Open configs/inference/realtime.yaml and set up your avatar block:

avator_1:
  preparation: True          # run once to build cached features
  bbox_shift: 5
  video_path: "data/video/1-prisha.mp4"   # or "data/video/1-prisha.png"
  audio_clips:
    audio_0: "data/audio/sumi.wav"

🔸 Note:
The video_path can point to either
• a short video file (.mp4, .mov, etc.), which captures natural head pose and lighting – often more stable results, or
• a single image (.png, .jpg), which is faster to prepare but produces a static-face style output.
Both work. MuseTalk automatically extracts frames and face regions based on what you provide.

⚙️ Step 2 – Prepare the avatar

Run the initial preparation and caching pass:

sh inference.sh v1.5 realtime

This performs one-time feature extraction and creates the cache folder:

/results/v15/avatars/avator_1/
├── avator_info.json
├── latents.pt
├── coords.pkl
└── masks/ ...

⚡ Step 3 – Switch to fast mode

Once the avatar has been prepared, set:

preparation: False

Then run fast inference:

python -m scripts.realtime_inference --version v15 \
  --inference_config configs/inference/realtime.yaml \
  --skip_save_images

(The valid flag is v15, not v1.5 – v1.5 will throw an “invalid choice” error.)

This command reuses the cached avatar, skips per-frame PNG writes, and benchmarks true inference speed on the GPU.

📊 Example timing results (RunPod A5000, 59-second audio)

Mode	Command	`preparation`	Media Type	Total Time
First run	`sh inference.sh v1.5 realtime`	`True`	image/video	≈ 90 s
Cached run	`sh inference.sh v1.5 realtime`	`False`	image/video	≈ 160 s (includes ffmpeg & disk I/O)
Fast mode	`python -m scripts.realtime_inference --version v15 --skip_save_images`	`False`	image/video	≈ 49 s (faster-than-real-time)

🎬 Outcome

preparation: True → builds and caches avatar geometry from image or video
preparation: False → reuses cached data for fast inference
--skip_save_images → disables PNG/ffmpeg steps for pure model speed

On RunPod A5000, MuseTalk v1.5 reaches near-real-time performance using either static image or short video inputs for video_path.

🧩 10) Monitor GPU Usage (Optional)

watch -n 1 nvidia-smi

✅ Refreshes GPU status every second – perfect to verify MuseTalk is utilizing the GPU and to monitor VRAM.

🧭 Environment Summary (Working)

Component	Version / Setting
GPU	NVIDIA A5000
RunPod Template	`sub38-pod`
Base Image	`runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel`
Python Env	`/muse_env` (activated via `source /muse_env/bin/activate`)
`which python`	`/muse_env/bin/python` (after activation)
PyTorch	`2.0.1` (CUDA 11.8)
huggingface_hub	`0.25.2` (CLI available)
FFmpeg	Static build `7.0.2` on `PATH`
Total Model Size	~8.7 GB
MuseTalk	Working end-to-end ✅

✅ Key Takeaways

Match PyTorch/CUDA with the base image (CU118 for this setup).
Pin huggingface_hub==0.25.2 to retain huggingface-cli.
Disable self-upgrades inside helper scripts that could break the CLI.
Install FFmpeg manually and export to PATH.
Activate the correct Python env (/muse_env) so installs go where you expect.
Verify weights (~8.7 GB) before first run.

🧯 Quick Troubleshooting

ModuleNotFoundError (mmcv/mmpose)
Ensure you ran openmim installs after activating your env and with the final PyTorch/CUDA pinned.
ffmpeg: command not found
Re-export PATH (new terminals don’t inherit). Consider adding to ~/.bashrc:echo 'export PATH="'"$FFMPEG_PATH"':$PATH"' >> ~/.bashrc
Slow/partial weight downloads
Re-run download_weights.sh. For large files, huggingface-cli download supports resuming.

With the above, MuseTalk runs reliably on RunPod A5000 using the PyTorch 2.0.1 + CUDA 11.8 image-no CUDA mismatches, working Hugging Face CLI, FFmpeg available, and all weights in place.