A Single Image and Multimodality Is All You Need for Novel View Synthesis

Amirhosein Javadi, Chi-Shiang Gau, Konstantinos D. Polyzos, Tara Javidi

Department of Electrical and Computer Engineering, University of California San Diego

ICLR 2026 Workshop

article Paper code Code format_quote BibTeX

Teaser Result

Example qualitative comparison (Ground Truth | Vision-Only | Ours) from the single-image novel view synthesis benchmark.

Abstract

This work makes single-image novel view synthesis more reliable by replacing fragile vision-only depth with sparse radar or LiDAR sensing. Even extremely sparse range measurements provide stronger geometry for diffusion-based video generation. The method reconstructs dense depth using localized Gaussian Processes, producing both depth estimates and uncertainty maps. These maps plug directly into existing diffusion pipelines without changing the generative model. On real-world driving scenes, multimodal depth improves visual quality, geometric alignment, and temporal consistency over image-only baselines. Using sparse radar reduces LPIPS by 23.5% and FID by 46.0%, showing that better geometry leads to better novel views.

Method

Multimodal novel view synthesis pipeline overview

Figure 1. Multimodal novel view synthesis pipeline overview.

Sparse Range Input & Angular-Domain Projection. Start with one RGB image and sparse radar or LiDAR measurements, then project both pixels and range points into a shared angular domain.
Local Neighborhood Selection & Localized GP Inference. For each viewing direction, select nearby sparse range measurements and use a localized Gaussian Process to estimate depth efficiently.
Dense Depth Map + Uncertainty Mask. Produce a dense depth map along with uncertainty estimates, then mask high-uncertainty regions to remove unreliable geometry.
3D Point Cloud Generation. Back-project the RGB image using the reconstructed depth to form a colored 3D point cloud of the scene.
Render Target Camera Views & Let Diffusion Complete the Scene. Render the point cloud along the target camera trajectory and feed the frames to a diffusion model to synthesize temporally consistent novel views.

Experiments

Qualitative Results for Single-Image Novel View Synthesis

Across diverse View-of-Delft scenes, replacing monocular depth with sparse range-based reconstruction gives better geometric alignment and visibly fewer rendering artifacts in novel views.

Quantitative Results for Single-Image Novel View Synthesis

On View-of-Delft, replacing vision-only depth with our multimodal reconstruction improves all video-generation metrics for both sparse radar (0.02% coverage) and sparse LiDAR (0.52% coverage). Relative to the vision-only baseline, radar improves PSNR by about 15.4% and SSIM by 6.6%, while reducing LPIPS by 23.5%, FID by 46.0%, and temporal LPIPS by 29.3%; LiDAR further improves performance, reaching the best overall scores across all metrics.

Quantitative Results for Depth Estimation Accuracy

Against LiDAR depth on valid pixels, our sparse radar depth reconstruction achieves the lowest MAE and RMSE-log compared with monocular baselines.

Ablation Study: Uncertainty-Based Masking

Masking high-uncertainty depth regions improves conditioning quality; retaining the most certain 80% gives the best trade-off between geometric reliability and coverage.

Conclusion

This work shows that reliable geometry is the key to better single-image novel view synthesis, especially in real-world scenes where monocular depth can fail.
By combining one RGB image with extremely sparse radar or LiDAR measurements, the method produces stronger depth priors without modifying the diffusion model.
The results demonstrate that multimodal sensing improves visual quality, geometric alignment, and temporal consistency, making diffusion-based novel view synthesis more practical for autonomous driving and 3D scene perception.

Cite this work

A. Javadi, C.-S. Gau, K. D. Polyzos, and T. Javidi, A Single Image and Multimodality Is All You Need for Novel View Synthesis, ICLR 2026 Workshop.

@inproceedings{javadi2026singleimage,
    title={A Single Image and Multimodality Is All You Need for Novel View Synthesis},
    author={Javadi, Amirhosein and Gau, Chi-Shiang and Polyzos, Konstantinos D. and Javidi, Tara},
    booktitle={ICLR 2026 Workshop},
    year={2026},
}