MuseTalk GPU Installation

MuseTalk for Real-Time Lip-Sync Avatars: Complete GPU Installation & Fixes

MuseTalk is a real-time talking-face system that combines pose estimation, voice analysis, and lip-sync. Getting it running smoothly on a cloud GPU like RunPod (NVIDIA A5000) can take a few tweaks. Below is the exact process that worked for me-plus the fixes for the common snags I hit (PyTorch/CUDA mismatch, missing huggingface-cli, FFmpeg path issues, and incomplete weight downloads).


🖥️ 1) RunPod Environment

  • GPU: NVIDIA A5000
  • Template: sub38-pod
  • Base Image: runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel

Clone the repo:

git clone https://github.com/TMElyralab/MuseTalk.git
cd MuseTalk

📦 2) (If present) Activate the prebuilt virtualenv

On this image, there’s a precreated environment at /muse_env. Activate it before installing anything so tools land in the right place:

python3 -m venv muse_env
source muse_env/bin/activate
which python
# e.g. /muse_env/bin/python

If /muse_env doesn’t exist in your template, create your own venv instead and use that consistently.


⚙️ 3) Install Python dependencies

pip install -r requirements.txt

This generally succeeds, but later you must ensure PyTorch and CUDA really match.


🧠 4) Fix PyTorch ↔ CUDA compatibility (cu118)

The base image ships with PyTorch 2.0.1 (CU118). Verify:

python -c "import torch; print(torch.__version__, torch.version.cuda)"
# Expect something like: 2.0.1 11.8

If it’s not CUDA 11.8 (e.g., you accidentally pulled CU12 wheels), pin the correct builds:

pip uninstall -y torch torchvision torchaudio
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 \
  -f https://download.pytorch.org/whl/torch_stable.html

✅ Result: PyTorch and CUDA align with the image (11.8).


🧩 5) Install OpenMMLab components (MMPose)

MuseTalk relies on MMPose. Use mim:

pip install --no-cache-dir -U openmim
mim install mmengine
mim install "mmcv==2.0.1"
mim install "mmdet==3.1.0"
mim install "mmpose==1.1.0"

✅ Result: MMCV & MMPose installed and compiled against your current stack.


🎬 6) Add FFmpeg (static build)

The base image usually lacks FFmpeg. Fetch a static build and expose it on PATH:

wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar -xvf ffmpeg-release-amd64-static.tar.xz
export FFMPEG_PATH="$PWD/ffmpeg-7.0.2-amd64-static"
export PATH="$FFMPEG_PATH:$PATH"

ffmpeg -version
# should print version info

(If your extracted folder name differs, update FFMPEG_PATH accordingly.)


💾 7) Download the model weights

MuseTalk ships a helper script:

bash ./download_weights.sh

If you see: huggingface-cli: command not found

Root cause: Newer huggingface_hub (>=1.0) removed the legacy CLI entry point.

Fix: install a CLI-providing version and keep it pinned

pip install "huggingface_hub[cli]==0.25.2" --force-reinstall
which huggingface-cli
# e.g. /muse_env/bin/huggingface-cli

Prevent auto-upgrades inside the script: open download_weights.sh and comment out the self-update line:

# pip install -U "huggingface_hub[cli]"   # keep pinned CLI working

Re-run:

bash ./download_weights.sh

🔍 8) Verify weights

du -h models
# e.g. ~8.7G  models

Typical artifacts include:

  • musetalk/*
  • musetalkV15/unet.pth
  • sd-vae/diffusion_pytorch_model.bin
  • whisper/pytorch_model.bin
  • dwpose/dw-ll_ucoco_384.pth
  • syncnet/latentsync_syncnet.pt

✅ Result: All weights present (≈8.7 GB total).


▶️ 9) Run MuseTalk

python app.py

You can also run a one-shot inference:

bash inference.sh
# or:
python app.py --input_video sample.mp4 --input_audio sample.wav --output output.mp4

✅ Result: Generates a realistic talking-face video from your inputs.


🧩 10) Running the Real-Time Pipeline (MuseTalk v1.5)

After MuseTalk was successfully installed and verified, we moved from the one-shot demo (python app.py) to the real-time inference pipeline designed for faster lip-sync generation – suitable for live or interactive AI avatar applications.


🧠 Step 1 – Configure your avatar

Open configs/inference/realtime.yaml and set up your avatar block:

avator_1:
  preparation: True          # run once to build cached features
  bbox_shift: 5
  video_path: "data/video/1-prisha.mp4"   # or "data/video/1-prisha.png"
  audio_clips:
    audio_0: "data/audio/sumi.wav"

🔸 Note:
The video_path can point to either
• a short video file (.mp4.mov, etc.), which captures natural head pose and lighting – often more stable results, or
• a single image (.png.jpg), which is faster to prepare but produces a static-face style output.
Both work. MuseTalk automatically extracts frames and face regions based on what you provide.


⚙️ Step 2 – Prepare the avatar

Run the initial preparation and caching pass:

sh inference.sh v1.5 realtime

This performs one-time feature extraction and creates the cache folder:

/results/v15/avatars/avator_1/
├── avator_info.json
├── latents.pt
├── coords.pkl
└── masks/ ...

⚡ Step 3 – Switch to fast mode

Once the avatar has been prepared, set:

preparation: False

Then run fast inference:

python -m scripts.realtime_inference --version v15 \
  --inference_config configs/inference/realtime.yaml \
  --skip_save_images

(The valid flag is v15, not v1.5 – v1.5 will throw an “invalid choice” error.)

This command reuses the cached avatar, skips per-frame PNG writes, and benchmarks true inference speed on the GPU.


📊 Example timing results (RunPod A5000, 59-second audio)

ModeCommandpreparationMedia TypeTotal Time
First runsh inference.sh v1.5 realtimeTrueimage/video≈ 90 s
Cached runsh inference.sh v1.5 realtimeFalseimage/video≈ 160 s (includes ffmpeg & disk I/O)
Fast modepython -m scripts.realtime_inference --version v15 --skip_save_imagesFalseimage/video≈ 49 s (faster-than-real-time)

🎬 Outcome

  • preparation: True → builds and caches avatar geometry from image or video
  • preparation: False → reuses cached data for fast inference
  • --skip_save_images → disables PNG/ffmpeg steps for pure model speed

On RunPod A5000, MuseTalk v1.5 reaches near-real-time performance using either static image or short video inputs for video_path.


🧩 10) Monitor GPU Usage (Optional)

watch -n 1 nvidia-smi

✅ Refreshes GPU status every second – perfect to verify MuseTalk is utilizing the GPU and to monitor VRAM.


🧭 Environment Summary (Working)

ComponentVersion / Setting
GPUNVIDIA A5000
RunPod Templatesub38-pod
Base Imagerunpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel
Python Env/muse_env (activated via source /muse_env/bin/activate)
which python/muse_env/bin/python (after activation)
PyTorch2.0.1 (CUDA 11.8)
huggingface_hub0.25.2 (CLI available)
FFmpegStatic build 7.0.2 on PATH
Total Model Size~8.7 GB
MuseTalkWorking end-to-end ✅

✅ Key Takeaways

  • Match PyTorch/CUDA with the base image (CU118 for this setup).
  • Pin huggingface_hub==0.25.2 to retain huggingface-cli.
  • Disable self-upgrades inside helper scripts that could break the CLI.
  • Install FFmpeg manually and export to PATH.
  • Activate the correct Python env (/muse_env) so installs go where you expect.
  • Verify weights (~8.7 GB) before first run.

🧯 Quick Troubleshooting

  • ModuleNotFoundError (mmcv/mmpose)
    Ensure you ran openmim installs after activating your env and with the final PyTorch/CUDA pinned.
  • ffmpeg: command not found
    Re-export PATH (new terminals don’t inherit). Consider adding to ~/.bashrc:echo 'export PATH="'"$FFMPEG_PATH"':$PATH"' >> ~/.bashrc
  • Slow/partial weight downloads
    Re-run download_weights.sh. For large files, huggingface-cli download supports resuming.

With the above, MuseTalk runs reliably on RunPod A5000 using the PyTorch 2.0.1 + CUDA 11.8 image-no CUDA mismatches, working Hugging Face CLI, FFmpeg available, and all weights in place.

Leave a Reply

Your email address will not be published. Required fields are marked *