A Single Image and Multimodality Is All You Need for Novel View Synthesis

Department of Electrical and Computer Engineering, University of California San Diego
ICLR 2026 Workshop

Teaser Result

Example qualitative comparison (Ground Truth | Vision-Only | Ours) from the single-image novel view synthesis benchmark.

Teaser qualitative result 00742

Abstract

This work makes single-image novel view synthesis more reliable by replacing fragile vision-only depth with sparse radar or LiDAR sensing. Even extremely sparse range measurements provide stronger geometry for diffusion-based video generation. The method reconstructs dense depth using localized Gaussian Processes, producing both depth estimates and uncertainty maps. These maps plug directly into existing diffusion pipelines without changing the generative model. On real-world driving scenes, multimodal depth improves visual quality, geometric alignment, and temporal consistency over image-only baselines. Using sparse radar reduces LPIPS by 23.5% and FID by 46.0%, showing that better geometry leads to better novel views.


Method

Multimodal novel view synthesis pipeline overview

Multimodal NVS pipeline overview

Figure 1. Multimodal novel view synthesis pipeline overview.

  • Sparse Range Input & Angular-Domain Projection. Start with one RGB image and sparse radar or LiDAR measurements, then project both pixels and range points into a shared angular domain.
  • Local Neighborhood Selection & Localized GP Inference. For each viewing direction, select nearby sparse range measurements and use a localized Gaussian Process to estimate depth efficiently.
  • Dense Depth Map + Uncertainty Mask. Produce a dense depth map along with uncertainty estimates, then mask high-uncertainty regions to remove unreliable geometry.
  • 3D Point Cloud Generation. Back-project the RGB image using the reconstructed depth to form a colored 3D point cloud of the scene.
  • Render Target Camera Views & Let Diffusion Complete the Scene. Render the point cloud along the target camera trajectory and feed the frames to a diffusion model to synthesize temporally consistent novel views.

Experiments

Qualitative Results for Single-Image Novel View Synthesis

Across diverse View-of-Delft scenes, replacing monocular depth with sparse range-based reconstruction gives better geometric alignment and visibly fewer rendering artifacts in novel views.

Qualitative result 00742
Qualitative result 01715
Qualitative result 01743
Qualitative result 02451
Qualitative result 07035
Qualitative result 07088
Qualitative result 08350
Qualitative result 08588

Quantitative Results for Single-Image Novel View Synthesis

On View-of-Delft, replacing vision-only depth with our multimodal reconstruction improves all video-generation metrics for both sparse radar (0.02% coverage) and sparse LiDAR (0.52% coverage). Relative to the vision-only baseline, radar improves PSNR by about 15.4% and SSIM by 6.6%, while reducing LPIPS by 23.5%, FID by 46.0%, and temporal LPIPS by 29.3%; LiDAR further improves performance, reaching the best overall scores across all metrics.

NVS quantitative metrics

Quantitative Results for Depth Estimation Accuracy

Against LiDAR depth on valid pixels, our sparse radar depth reconstruction achieves the lowest MAE and RMSE-log compared with monocular baselines.

Depth estimation quantitative metrics

Ablation Study: Uncertainty-Based Masking

Masking high-uncertainty depth regions improves conditioning quality; retaining the most certain 80% gives the best trade-off between geometric reliability and coverage.

Ablation uncertainty masking

Conclusion


Cite this work

A. Javadi, C.-S. Gau, K. D. Polyzos, and T. Javidi, A Single Image and Multimodality Is All You Need for Novel View Synthesis, ICLR 2026 Workshop.
@inproceedings{javadi2026singleimage,
    title={A Single Image and Multimodality Is All You Need for Novel View Synthesis},
    author={Javadi, Amirhosein and Gau, Chi-Shiang and Polyzos, Konstantinos D. and Javidi, Tara},
    booktitle={ICLR 2026 Workshop},
    year={2026},
}