MuseTalk is a real-time talking-face system that combines pose estimation, voice analysis, and lip-sync. Getting it running smoothly on a cloud GPU like RunPod (NVIDIA A5000) can take a few tweaks. Below is the exact process that worked for me-plus the fixes for the common snags I hit (PyTorch/CUDA mismatch, missing huggingface-cli, FFmpeg path issues, and incomplete weight downloads).
🖥️ 1) RunPod Environment
- GPU: NVIDIA A5000
- Template:
sub38-pod - Base Image:
runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel
Clone the repo:
git clone https://github.com/TMElyralab/MuseTalk.git
cd MuseTalk
📦 2) (If present) Activate the prebuilt virtualenv
On this image, there’s a precreated environment at /muse_env. Activate it before installing anything so tools land in the right place:
python3 -m venv muse_env
source muse_env/bin/activate
which python
# e.g. /muse_env/bin/python
If
/muse_envdoesn’t exist in your template, create your own venv instead and use that consistently.
⚙️ 3) Install Python dependencies
pip install -r requirements.txt
This generally succeeds, but later you must ensure PyTorch and CUDA really match.
🧠 4) Fix PyTorch ↔ CUDA compatibility (cu118)
The base image ships with PyTorch 2.0.1 (CU118). Verify:
python -c "import torch; print(torch.__version__, torch.version.cuda)"
# Expect something like: 2.0.1 11.8
If it’s not CUDA 11.8 (e.g., you accidentally pulled CU12 wheels), pin the correct builds:
pip uninstall -y torch torchvision torchaudio
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2+cu118 \
-f https://download.pytorch.org/whl/torch_stable.html
✅ Result: PyTorch and CUDA align with the image (11.8).
🧩 5) Install OpenMMLab components (MMPose)
MuseTalk relies on MMPose. Use mim:
pip install --no-cache-dir -U openmim
mim install mmengine
mim install "mmcv==2.0.1"
mim install "mmdet==3.1.0"
mim install "mmpose==1.1.0"
✅ Result: MMCV & MMPose installed and compiled against your current stack.
🎬 6) Add FFmpeg (static build)
The base image usually lacks FFmpeg. Fetch a static build and expose it on PATH:
wget https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar -xvf ffmpeg-release-amd64-static.tar.xz
export FFMPEG_PATH="$PWD/ffmpeg-7.0.2-amd64-static"
export PATH="$FFMPEG_PATH:$PATH"
ffmpeg -version
# should print version info
(If your extracted folder name differs, update FFMPEG_PATH accordingly.)
💾 7) Download the model weights
MuseTalk ships a helper script:
bash ./download_weights.sh
If you see: huggingface-cli: command not found
Root cause: Newer huggingface_hub (>=1.0) removed the legacy CLI entry point.
Fix: install a CLI-providing version and keep it pinned
pip install "huggingface_hub[cli]==0.25.2" --force-reinstall
which huggingface-cli
# e.g. /muse_env/bin/huggingface-cli
Prevent auto-upgrades inside the script: open download_weights.sh and comment out the self-update line:
# pip install -U "huggingface_hub[cli]" # keep pinned CLI working
Re-run:
bash ./download_weights.sh
🔍 8) Verify weights
du -h models
# e.g. ~8.7G models
Typical artifacts include:
musetalk/*musetalkV15/unet.pthsd-vae/diffusion_pytorch_model.binwhisper/pytorch_model.bindwpose/dw-ll_ucoco_384.pthsyncnet/latentsync_syncnet.pt
✅ Result: All weights present (≈8.7 GB total).
▶️ 9) Run MuseTalk
python app.py
You can also run a one-shot inference:
bash inference.sh
# or:
python app.py --input_video sample.mp4 --input_audio sample.wav --output output.mp4
✅ Result: Generates a realistic talking-face video from your inputs.
🧩 10) Running the Real-Time Pipeline (MuseTalk v1.5)
After MuseTalk was successfully installed and verified, we moved from the one-shot demo (python app.py) to the real-time inference pipeline designed for faster lip-sync generation – suitable for live or interactive AI avatar applications.
🧠 Step 1 – Configure your avatar
Open configs/inference/realtime.yaml and set up your avatar block:
avator_1:
preparation: True # run once to build cached features
bbox_shift: 5
video_path: "data/video/1-prisha.mp4" # or "data/video/1-prisha.png"
audio_clips:
audio_0: "data/audio/sumi.wav"
🔸 Note:
Thevideo_pathcan point to either
• a short video file (.mp4,.mov, etc.), which captures natural head pose and lighting – often more stable results, or
• a single image (.png,.jpg), which is faster to prepare but produces a static-face style output.
Both work. MuseTalk automatically extracts frames and face regions based on what you provide.
⚙️ Step 2 – Prepare the avatar
Run the initial preparation and caching pass:
sh inference.sh v1.5 realtime
This performs one-time feature extraction and creates the cache folder:
/results/v15/avatars/avator_1/
├── avator_info.json
├── latents.pt
├── coords.pkl
└── masks/ ...
⚡ Step 3 – Switch to fast mode
Once the avatar has been prepared, set:
preparation: False
Then run fast inference:
python -m scripts.realtime_inference --version v15 \
--inference_config configs/inference/realtime.yaml \
--skip_save_images
(The valid flag is v15, not v1.5 – v1.5 will throw an “invalid choice” error.)
This command reuses the cached avatar, skips per-frame PNG writes, and benchmarks true inference speed on the GPU.
📊 Example timing results (RunPod A5000, 59-second audio)
| Mode | Command | preparation | Media Type | Total Time |
|---|---|---|---|---|
| First run | sh inference.sh v1.5 realtime | True | image/video | ≈ 90 s |
| Cached run | sh inference.sh v1.5 realtime | False | image/video | ≈ 160 s (includes ffmpeg & disk I/O) |
| Fast mode | python -m scripts.realtime_inference --version v15 --skip_save_images | False | image/video | ≈ 49 s (faster-than-real-time) |
🎬 Outcome
preparation: True→ builds and caches avatar geometry from image or videopreparation: False→ reuses cached data for fast inference--skip_save_images→ disables PNG/ffmpeg steps for pure model speed
On RunPod A5000, MuseTalk v1.5 reaches near-real-time performance using either static image or short video inputs for video_path.
🧩 10) Monitor GPU Usage (Optional)
watch -n 1 nvidia-smi
✅ Refreshes GPU status every second – perfect to verify MuseTalk is utilizing the GPU and to monitor VRAM.
🧭 Environment Summary (Working)
| Component | Version / Setting |
|---|---|
| GPU | NVIDIA A5000 |
| RunPod Template | sub38-pod |
| Base Image | runpod/pytorch:2.0.1-py3.10-cuda11.8.0-devel |
| Python Env | /muse_env (activated via source /muse_env/bin/activate) |
which python | /muse_env/bin/python (after activation) |
| PyTorch | 2.0.1 (CUDA 11.8) |
| huggingface_hub | 0.25.2 (CLI available) |
| FFmpeg | Static build 7.0.2 on PATH |
| Total Model Size | ~8.7 GB |
| MuseTalk | Working end-to-end ✅ |
✅ Key Takeaways
- Match PyTorch/CUDA with the base image (CU118 for this setup).
- Pin
huggingface_hub==0.25.2to retainhuggingface-cli. - Disable self-upgrades inside helper scripts that could break the CLI.
- Install FFmpeg manually and export to
PATH. - Activate the correct Python env (
/muse_env) so installs go where you expect. - Verify weights (~8.7 GB) before first run.
🧯 Quick Troubleshooting
ModuleNotFoundError(mmcv/mmpose)
Ensure you ranopenmiminstalls after activating your env and with the final PyTorch/CUDA pinned.ffmpeg: command not found
Re-exportPATH(new terminals don’t inherit). Consider adding to~/.bashrc:echo 'export PATH="'"$FFMPEG_PATH"':$PATH"' >> ~/.bashrc- Slow/partial weight downloads
Re-rundownload_weights.sh. For large files,huggingface-cli downloadsupports resuming.
With the above, MuseTalk runs reliably on RunPod A5000 using the PyTorch 2.0.1 + CUDA 11.8 image-no CUDA mismatches, working Hugging Face CLI, FFmpeg available, and all weights in place.
