We propose a novel 3D-aware diffusion-based method for generating photorealistic talking head videos directly from a single identity image and explicit control signals (e.g., expressions).
Our method generates Multiplane Images (MPIs) that ensure geometric consistency, making them ideal for immersive viewing experiences like binocular videos for VR headsets. Unlike existing methods that often require a separate stage or joint optimization to reconstruct a 3D representation (such as NeRF or 3D Gaussians), our approach directly generates the final output through a single denoising process, eliminating the need for post-processing steps to render novel views efficiently.
To effectively learn from monocular videos, we introduce a training mechanism that reconstructs the output MPI randomly in either the target or the reference camera space. This approach enables the model to simultaneously learn sharp image details and underlying 3D information. Extensive experiments demonstrate the effectiveness of our method, which achieves competitive avatar quality and novel-view rendering capabilities, even without explicit 3D reconstruction or high-quality multi-view training data.
Our model is built on the architecture of Lumiere, which takes an identity image, 2D noise video, the sequence of expressions rendered from 3DMM model and first frame image as input and outputs MPI video sequences. During inference, the network is conditioned on a reference portrait and takes the last frame of previously generated clip as the first frame condition. We separate the network into a color branch and a geometry branch.
We compare our method with existing one-shot photoreal talking heads approaches trained from monocular videos in the wild: Face-V2V, EMOPortrait, Portrait4D-v2, X-Portrait, Follow-Your-Emoji in a self-reenactment setting.
Here we show side view renderings. We generate all the videos with the same setting as self-reenactments where the first frame is used as reference portrait and the rest frames become driving signals. Since we use MPI as our scene representation, the generated talking head videos does not support rendering from excessive camera viewpoint changes. However, our method can still generate reasonable stereo videos for viewing purpose.
We can further set two fixed side view cameras and render generated MPI videos from . -5° and +5°.
Despite trained on real world talking-head videos, our model still generalizes to stylized portraits generated by Stable-Diffusion.
We show the long videos generated by our model.
Here we show stereo videos rendered from generated MPI videos. We test all the stereo videos with Google Cardboard. Specifically, we tested the stereo videos on a 6.1-inch iPhone 15 Pro in album's preview-mode, slideshow, and fullscreen play. We also tested the stereo videos on a 6.7-inch Google Pixel 7 Pro using fullscreen play.
@article{li2025IMPortrait,
author = {Li, Yuan and Bai, Ziqian and Tan, Feitong and Cui, Zhaopeng and Fanello, Sean and Zhang, Yinda},
title = {IM-Portrait: Learning 3D-aware Video Diffusion for PhotorealisticTalking Heads from Monocular Videos},
journal = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2025},
}