As a fundamental task in computer vision, object detection has long been a challenging visual task. However, current object detection models lack attention to salient features when fusing the lateral connections and top-down information flows in feature pyramid networks (FPNs). To address this, we propose a method for object detection based on an enhanced bi-directional attention feature pyramid network, which aims to enhance the feature representation capability of lateral connections and top-down links in FPN. This method adopts the triplet module to give attention to salient features in the original multi-scale information in spatial and channel dimensions, establishing an enhanced triplet attention. In addition, it introduces improved top and down attention to fuse contextual information using the correlation of features between adjacent scales. Furthermore, adaptively spatial feature fusion and self-attention are introduced to expand the receptive field and improve the detection performance of deep levels. Extensive experiments conducted on the PASCAL VOC, MS COCO, KITTI, and CrowdHuman datasets demonstrate that our method achieves performance gains of 1.8%, 0.8%, 0.5%, and 0.2%, respectively. These results indicate that our method has significant effects and is competitive compared with advanced detectors.
Due to the small size, high density, and background noise associated with strip surface defects, the current object detection model commonly faces limitations in performance. To address this issue, we propose a spatial-to-depth feature-enhanced detection method called STD-Detector. The method consists of two types STD-Conv-A and STD-Conv-B. First, the STD-Conv-A module is used in the backbone feature extraction network to expand the field of perception and enable the model to learn a wider range of background information. Then, the STD-Conv-B module is used for feature fusion networks to improve the expression of output features. In addition, we incorporate the convolutional block attention module to mitigate background interference and enhance the performance of the model. Finally, experimental results on the NEU-DET dataset show that our method achieves a mean average precision of 82.9%, which represents a 3.9% improvement over the baseline. Compared with the state-of-the-art model, our method exhibits greater competitiveness in detecting strip surface defects. Moreover, experimental results on the road surface defect dataset show that our method has good robustness.
Real-time semantic segmentation has been challenging, and the fusion of features from different branches remains crucial to improvement. The two-branch structure has shown promising results in real-time semantic segmentation. However, upsampling feature maps from the semantic branch to match the detail branch leads to a loss of object feature information and compromises segmentation accuracy. We propose a deep bilateral fusion and bilateral embedded network (BFBE-Net) based on the encoder–decoder structure for real-time semantic segmentation to address these issues. The BFBE-Net adopts a two-branch design in the encoder, with a top-down fusion module and a bottom-up fusion module designed to integrate multi-scale context information in the channel dimension, and assigns different weights to detailed information and semantic information to enhance information characteristics. In the decoder, a bilateral embedded attention module under the guidance of spatial and channel attention integrates semantic and spatial features, gradually upsampling feature maps to reduce the loss of feature information. In addition, an enhanced aggregation pyramid pooling module is designed to efficiently extract contextual information by combining depth-wise asymmetric convolution. The proposed algorithm is evaluated on two benchmark datasets, Cityscapes and CamVid, achieving 78.5% mean intersection over union (mIoU) at 82 frames per second (FPS) on the Cityscapes test set and 79.2% mIoU at 131 FPS on the CamVid test set. The proposed BFBE-Net not only improves segmentation accuracy but also ensures real-time performance.
Deeplab series semantic segmentation algorithms extract target semantic features using deep layers of a convolutional neural network, resulting in target features lacking detailed information, such as edges and shapes extracted by shallow layers. Deeplabv3plus uses atrous convolution to obtain feature maps, which lose some image information. All of the above have an impact on segmentation performance improvement. In response to these issues, which reduce segmentation performance, we propose an image semantic segmentation algorithm based on a multi-expert system that builds multiple expert models based on the Deeplabv3plus network architecture. For the target image, each expert model makes independent judgments, and the segmentation results are obtained through the ensemble learning of these expert models. Expert model 1 employs the proposed attention-based atrous spatial pyramid pooling (C-ASPP) module to capture richer global semantic information via a parallel attention mechanism and ASSP module. Expert model 2 designs a feature fusion-based decoder that uses a feature fusion approach to obtain detailed information. Expert model 3 introduces a loss function in the Deeplabv3plus network for supervised detailed information loss. The final segmentation results are generated by adjudicating the results derived by the different expert models, which improves the segmentation performance by compensating for the loss of detailed information and enhancing the semantic features. Evaluated on the commonly used semantic segmentation datasets PASCAL VOC 2012 and CamVid, the algorithm’s mIoU reached 82.42% and 69.18%, respectively, which were 2.46% and 1.82% higher than Deeplabv3plus, proving the better segmentation performance of the algorithm.
To fully develop the complementary advantages of different visual features and to improve the robustness of multi-feature fusions, we propose a robust correlation filter tracker with adaptive multi-complementary features fusion based on game theory. By combining the complementary features selected from handcrafted features and convolution features, our method constructs two robust combined features in the tracking framework of discriminative correlation filters (DCFs). In addition, by utilizing game theory, the two combined features are regarded as two sides of the game, achieving the best balance through continuous gaming throughout the tracking process and thus obtaining a more robust fused feature. The experimental results obtained on the OTB2015 benchmark dataset demonstrate that our tracker improves the robustness of object tracking in complex scenarios, such as occlusion and deformation, and performs favorably against eight state-of-the-art methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.