This study demonstrates that a transformer-based neural operator (TNO) can perform zero-shot super-resolution of two-dimensional temperature fields near the ground in urban areas. During training, super-resolution is performed from a horizontal resolution of 100 m to 20 m, while during testing, it is performed from 100 m to a finer resolution of 5 m. This setting is referred to as zero-shot, since no data with the target 5 m resolution are included in the training dataset. The 20 m and 5 m resolution data were independently obtained by dynamically downscaling the 100 m data using a physics-based micrometeorology model that resolves buildings. Compared to a convolutional neural network, the TNO more accurately reproduces temperature distributions at 5 m resolution and reduces test errors by approximately 33%. Furthermore, the TNO successfully performs zero-shot super-resolution even when trained with unstructured data, in which grid points are randomly arranged. These results suggest that the TNO recognizes building shapes independently of grid point locations and adaptively infers the temperature fields induced by buildings.