Semantic segmentation of urban areas is crucial for many applications. Self-supervised networks require few or no labels for training, making them highly appealing approaches. One such network is STEGO1, which builds upon DINO2 and operates without any labeled data, yet effectively segments buildings, vegetation, and roads in the ISPRS Potsdam dataset3. The resulting segmentations are refined using Conditional Random Fields (CRF). In remote sensing, additional channels like the Normalized Digital Surface Model (NDSM) enhance the segmentation task, as pixels of the same class often exhibit similar elevation characteristics, especially when adjacent. Since the transformer-based DINO network is built for RGB data, we extend the CRF with NDSM information to overcome this limitation, introducing a second pairwise potential that encourages neighboring pixels with similar elevation to have the same label. For evaluation in both the linear and cluster probe, we employ Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) to assess the segmentation against the six classes of the Potsdam dataset, besides standard IoU and accuracy metrics. Enhancing the CRF with elevation information improves the mIoU by 0.83% over the RGB only baseline in the cluster probe, which constitutes a considerable improvement.
|