Sat2City v2

Native 3D City Asset Generation from a Single Satellite Image

Tongyan Hua^†1 Dongli Wu^†1 Jinjing Zhu² Yinrui Ren¹ Zhongcheng Hong¹ Ying-Cong Chen^1,4 Hui Xiong^1,4 Wufan Zhao^✉1

Pre-print

¹HKUST(GZ) ²The Chinese University of Hong Kong ⁴HKUST

^†Equal contribution

Abstract

Generating explicit, high-quality 3D city assets from a single satellite image is important for digital twins, urban simulation, and geospatial intelligence. Sat2City v2 advances the original Sat2City from a synthetic height-map-conditioned generator to a real-world, appearance-controllable satellite-to-3D asset framework. The key idea is to adapt a pretrained native structured-latent 3D foundation model to weakly aligned satellite-image and textured-mesh pairs collected from matched geographic bounding boxes.

We construct a dataset with 14,651 training and 1,590 held-out test satellite-mesh pairs across 24 regions in 9 cities. Sat2City v2 encodes each city mesh into a pretrained native 3D latent space, fine-tunes an image-conditioned shape-flow model, and uses the decoded shape as an anchor for satellite-conditioned texturing. The resulting pipeline produces reusable textured mesh assets instead of rendering-oriented 3D proxies.

What Is New in v2?

Data

Synthetic → Real

Sat2City v1 supports synthetic data only; v2 learns from real satellite imagery and real-world city meshes.

Appearance

Random → Controllable

In v1, appearance is generated randomly; v2 conditions appearance on the input satellite image.

Resolution

128 → 512

Increases resolution from 128 in v1 to 512 in v2, recovering finer city geometry and texture details.

@article{hua2026sat2cityv2, title={Sat2City v2: Native 3D City Asset Generation from a Single Satellite Image}, author={Hua, Tongyan and Wu, Dongli and Zhu, Jinjing and Ren, Yinrui and Hong, Zhongcheng and Chen, Ying-Cong and Xiong, Hui and Zhao, Wufan}, journal={Pre-print}, year={2026} }