Depth estimation and semantic segmentation are crucial for visual perception and scene understanding. Multi-task learning, which captures shared features across multiple tasks within a scene, is often applied to depth estimation and semantic segmentation tasks to jointly improve accuracy. In this paper, a deformable attention-guided network for multi-task learning is proposed to enhance the accuracy of both depth estimation and semantic segmentation. The primary network architecture consists of a shared encoder, initial pred modules, deformable attention modules and decoders. RGB images are first input into the shared encoder to extract generic representations for different tasks. These shared feature maps are then decoupled into depth, semantic, edge and surface normal features in the initial pred module. At each stage, effective attention is applied to depth and semantic features under the guidance of fusion features in the deformable attention module. The decoder upsamples each deformable attention-enhanced feature map and outputs the final predictions. The proposed model achieves mIoU accuracy of 44.25% and RMSE of 0.5183, outperforming the single task baseline, multi-task baseline and state-of-the-art multi-task learning model.
|