Recent advancements in generative models have enabled
3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and
beyond. However, most existing methods rely heavily on
neural rendering techniques, which hinder their ability to
produce detailed 3D structures on a broader scale, largely
due to the inherent structural ambiguity derived from relatively limited 2D observations.
To address this challenge,
we propose Sat2City, a novel framework that synergizes
the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D
city dataset. Our approach is enabled by three key components:
- A cascaded latent diffusion framework that
progressively recovers 3D city structures from satellite imagery,
- a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids
for stable appearance optimization,
- an inverse sampling strategy enabling implicit supervision for smooth appearance transitioning,
- To overcome the challenge of collecting real-world city-scale 3D models with high-quality
geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height
maps.
Validated on this dataset, our framework generates
detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation
models.