Sat2City

We propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset.

Abstract

Recent advancements in generative models have enabled 3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and beyond. However, most existing methods rely heavily on neural rendering techniques, which hinder their ability to produce detailed 3D structures on a broader scale, largely due to the inherent structural ambiguity derived from relatively limited 2D observations. To address this challenge, we propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset. Our approach is enabled by three key components:

A cascaded latent diffusion framework that progressively recovers 3D city structures from satellite imagery,
a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids for stable appearance optimization,
an inverse sampling strategy enabling implicit supervision for smooth appearance transitioning,
To overcome the challenge of collecting real-world city-scale 3D models with high-quality geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height maps.

Validated on this dataset, our framework generates detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation models.

Architecture

We adopt XCube’s sparse voxel grid as a unified neural representation for joint geometry and appearance encoding.

Key Components

1. Re-Hash

To support multi-scale appearance modeling, we build coarser voxel grids by re-assigning values from finer ones. Each coarser voxel inherits interpolated features from its neighboring finer-level vertices, ensuring smooth transitions across scales.

2. Inverse Sampling

Inverse Sampling computes each point’s color by applying trilinear interpolation over the surrounding vertex colors in the voxel grid. During training, the predicted color at each point is compared with the ground-truth color, and the loss is backpropagated to the contributing vertex features via the interpolation weights.

Sat2City Dataset

The raw mesh model is created by artists in Blender.

We sample a dense point cloud from mesh surfaces and simulate a height map by mapping the y-axis of the surface texture coordinates into grayscale.

The height field images are cropped into 300x 300 pixel segments along with aligned point cloud. To better aligned with real-world cases, noise is introduced by applying contrast scaling to the rendered height map.

Comparison

Compared to the city generation method that rely on volume rendering, our framework produces more consistent 3D geometry and texture at a similar scale.

More Results

BibTeX

@inproceedings{hua2025sat2city,
      title   = {Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion},
      author  = {Hua, Tongyan and Jiang, Lutao and Chen, Ying-Cong and Zhao, Wufan},
      booktitle={ICCV},
      year={2025}
    }