PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.
This PDF file contains the front matter associated with SPIE Proceedings Volume 13521, including the Title Page, Copyright information, Table of Contents, and Conference Committee information.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Despite recent advances in Video Object Segmentation (VOS), state-of-the-art models still face challenges with occlusion, fast motion, and long-term tracking. These difficulties often result in a noticeable degradation of segmentation accuracy as the video progresses, leading to poor performance over extended periods. To tackle these persistent issues, we propose an innovative approach that enhances video segmentation by introducing predictive segmentation heads. Building upon the Cutie model, our method enables the model to predict Intersection over Union (IoU) results without ground truth, thereby enhancing video segmentation performance in challenging scenarios.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Document layout analysis can help people better understand and apply the information in a document. However, the diversity of document elements presents a significant challenge for layout analysis. In this study, we designed the Dilation-wise Residual (DWR) module as the backbone structure of the network, using the YOLO model as a baseline. The combination of various void rates in this module enhances the image feature extraction process significantly. Moreover, we incorporated the weighted feature fusion method to enhance the fusion efficiency of layout elements, thereby facilitating a more exhaustive feature extraction. To optimize the complexity of the model and make the model more lightweight, the Slim-neck paradigm is introduced to reduce the number of network parameters and the amount of calculation. The proposed model achieves remarkable results on two different public datasets, PubLayNet and DocLayNet. The mean Average Precision(mAP) on the PubLayNet dataset reached 95.7%, which was 2.1% higher than the baseline model, and the number of parameters of the model was reduced by 19% without changing the accuracy of the model.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In situations where vehicles are traveling against the light during midday, objects that need to be identified can become obscured by the glare, making it challenging for autonomous driving systems to accurately recognize them. Research on the processing of overexposed images in such scenarios remains relatively limited. Using the Road Vehicle Images Dataset, we conducted experiments to assess the performance of SSR, MSR, and MSRCR in handling overexposed images, with MSRCR yielding the best results. Building on MSRCR, we further enhanced the images by applying gamma correction. First, we simulated overexposure, then applied MSRCR to recover the image details, followed by gamma correction to adjust the brightness of the enhanced images. The results were evaluated using SSIM and PSNR metrics. The average SSIM value exceeded 70%, and the average PSNR value was above 10. Compared to images processed with MSRCR alone, the addition of gamma correction resulted in a 5% improvement in SSIM and an increase of approximately 3 decibels in PSNR.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Compared to natural images, remote sensing images suffer from more severe detail loss and face greater challenges in image reconstruction. We proposes a super-resolution network for remote sensing images based on hybrid dilated convolution and adaptive pruning. The network expands the receptive field through hybrid dilated convolution to extract more detailed image features. Bilinear interpolation is employed to upsample the extracted features, while residual connections are introd uced to reduce information loss in the convolutional layers. A multi-scale attention fusion block is employed to enhance both local and global information, allowing for improved recovery of the details and overall structure of remote sensing images. The increased computational complexity and parameter count of the fused network make deployment on mobile and edge devices challenging. Therefore, a dual-channel adaptive pruning module is employed to prune redundant channel features, which enhances the model's lightweight nature and practicality while preserving image details. The experimental results indicate that the proposed method effectively reconstructs texture details in remote sensing images while maintaining a balanced trade-off between computational efficiency and the quality of super-resolution reconstruction.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The versatile application of deep learning has made it prone to vulnerabilities, particularly adversarial attacks, which have been increasingly prevalent as its scope expands. Attack methods based on adversarial queries pose significant challenges to image retrieval models, disrupting their functionality. While numerous defense methods have been proposed for adversarial query attacks in image classification, approaches to detecting or defending against such attacks in image retrieval remain limited due to inherent challenges in adapting these methods. Based on these observations, we introduce the Augmentation Detection based on Top-K-List (ADTL) method, the first effective approach for detecting adversarial attacks in image retrieval. ADTL applies image augmentation to a query and determines its adversarial nature by calculating the divergence in the Top-K list before and after augmentation. This paper presents a method that employs suitable image augmentation techniques and K-Reciprocal clustering for pre-classifying images in the gallery to ease divergence calculations, which has been tested on the Oxford5k and Paris6k datasets and has shown remarkable effectiveness with a high detection accuracy of 87.3%.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In the context of the ongoing advancement of deep learning technologies, traditional image segmentation approaches are gradually become difficult for adapted to the requirements of medical imaging. The importance of medical image segmentation models is increasing, and they play a key role in accurately delineating anatomical structures and lesion regions. This is crucial for accurate diagnosis of diseases, formulation of treatment plans, and monitoring of disease course. In this paper, we propose a compound attention image segmentation model, CAU-Net, which aims to improve the segmentation accuracy of skin lesion images by introducing the receptive field module (RFB), the double squeezing excitation (DSE) module, and our innovatively proposed compound attention (CA) module lesion image segmentation accuracy. The design of CAU-Net focuses on enhancing the feature extraction capability and contextual information capturing ability, and improving the edge segmentation capability and play a strong role in the processing of medical images with complex edges and chaotic backgrounds. The model reached an mDice to 0.9408 and a recall to 0.9418 on the ISIC 2018 dataset. Future work can further optimize these modules, extend to more medical image tasks.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we propose an improved B-H-Deformable-DETR (Bayesian H-Deformable-DETR) model for the problem of insufficient accuracy and generalization ability in the task of Few-Shot Object Detection. In our study, we adopt Bayesian MLP (Multi-Layer Perceptron based on Bayesian Linear Layers) as the prediction layer of bounding box regression. By introducing prior distribution and uncertainty estimation, we transform the neural network with deterministic parameters into a probabilistic neural network with stochastic properties. This probabilistic neural network can handle the variability and sparsity of data more effectively with the expectation of improving the model's performance in dealing with uncertainty and generalization. As a result, under the setting of using ResNet50 and Swin-Transformer as backbone and training on the first 5000 images of the COCO dataset, the AP (average precision) of our model improves by +5.4 AP and +4.6 AP compared to the original H-Deformable-DETR model. These results indicate that our model has significantly improved prediction accuracy and enhanced generalization ability on small sample datasets.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recently, XR virtual shooting has developed rapidly, with LED displays playing a pivotal role in promoting the prosperity of the film and television industry. However, the necessity for additional lighting to achieve fill light often results in specular or diffuse reflections from the LED displays, which can be prominently visible in the video data, thus affecting the overall visual quality. Meanwhile, given the strict time constraints in most XR scenes, lightweight approaches are indispensable. To benchmark single reflection removal of LED image, we collect and introduce a new dataset named AOTO Denoising Dataset, comprising 5500 LED images with white reflection areas. Furthermore, we propose a novel lightweight denoising model called MoE-LED-light for efficient LED image reflection removal. Specifically, we innovatively adopt the proposed mixture-of-experts strategy to enhance the performance of reflection removal without compromising inference efficiency. Experimental results on our AOTO denoising dataset and four public datasets have demonstrated that our MoE-LED-light model not only effectively eliminates white reflection areas in images but also achieves efficient performance, operating at over 50 frames per second.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The majority of existing low-light image enhancement methods are based on uniform low illumination, they are prone to issues such as overexposure, dark area noise amplification, when applied to nighttime road images with various lighting and high noise levels. This paper proposes a zero-reference two-stage nighttime road image enhancement method. In the first stage, a lightweight visual attention network (LVAN) is developed to generate the dark-aware attention map, effectively avoiding the problems of overexposure and underexposure. The second stage is enhanced with a zero-reference depth image enhancement network in CIELab color space (ZeroDIE_Lab), which enhance image brightness significantly, suppresses noise and artefacts effectively while maintaining color consistency. In comparison with established low-light image enhancement approaches, such as Zero-DCE, the experimental results demonstrate that the proposed method not only enhances the visual clarity and structural details of the image, but also exhibits notable advantages in quantitative evaluation metrics, including image quality assessment and computational complex.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Multimodal image fusion integrates data from multiple sensors to produce richer and more informative images. One notable instance is the combination of visible and infrared pictures, which combines the fine-grained texture of a visible picture with the heat emission of an ir image. How well the image fusion works is extremely critical for the initial feature extraction. We have designed a SAFMB module that allows for better extraction of image features at the initial stage, i.e., separating the visible and infrared pictures separated both high- and low-frequency components. In the subsequent encoder part, the FPU-NET structure is used to obtain multiple feature layers for better fusion of up-sampling and down-sampling. To get fused infrared and visible maps, the picture is rebuilt from the new low and high frequency portions. The model was executed on the RoadScene dataset, improving standard metrics such as SD, EN and SSIM.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In the field of underwater sonar image classification, challenges such as limited data and severe class imbalance hinder conventional convolutional neural networks from achieving optimal accuracy. This study introduces long-tail recognition algorithms, enhanced resampling methods, and transfer learning strategies with novel balancing techniques to improve classification performance. When applied to the Seabed Objects-KLSG dataset, the improved resampling method minimizes information loss in major classes while simultaneously enhancing accuracy for tail classes, compared to traditional approaches. Furthermore, the use of a pre-trained ResNet network in conjunction with the new rebalancing method yields higher classification accuracy, particularly for tail classes, as demonstrated through five-fold cross-validation. Finally, to address limitations in feature learning associated with transfer learning, we propose an enhanced Bilateral Branch Network (BBN), which exhibits superior performance in feature extraction and maintains recognition accuracy above 90%. These innovative methods not only bolster the model's generalization ability but also provide a new technical framework for the accurate classification of underwater sonar images.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recently, the Masked AutoEncoder (MAE) is proposed and demonstrated its proficiency as a vision learner through an ingenious encoder-decoder architecture, drastically enhancing both efficiency in pre-training and accuracy in fine-tuning. In this study, we compare different masking strategies including random strategy, Zig-Zag, Hilbert, Peano, and Spiral scanning strategies of MAE for self-supervised Learning. Experimental results show Spiral scanning masking strategy performs better than other methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The rapid advancement of autonomous driving technologies necessitates efficient and reliable traffic sign recognition systems, especially under the constraints of limited computational resources and real-time decision-making demands. This paper proposes a novel hybrid model, integrating Dilated Convolutional Neural Networks with an Efficient Mamba framework to address these challenges. The proposed system leverages pre-processing techniques, such as noise reduction through Gaussian or median filters, to minimize false detections. A selective state space model further refines the recognition process, while the Efficient Mamba architecture, optimized for low-resource environments, combines lightweight neural networks with CNN-based feature extraction. Our approach is evaluated on the GTSRB dataset, focusing on accuracy, precision, recall, and F1 score, with a particular emphasis on prediction latency to meet real-time operational requirements. Comprehensive experiments, including k-fold cross-validation, demonstrate the superior efficiency and effectiveness of our model compared to baseline methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Transmission lines are an indispensable part of the power grid, and are an important channel for transmitting electric energy across the country. Due to the complexity of the location and environment of overhead transmission lines, these external breakage influencing factors pose great challenges to the normal operation of transmission lines and the reliability of equipment. Therefore, it is of great significance to study the intelligent identification method of external breakage of transmission lines to ensure the safe and stable operation of transmission lines. To this end, this paper proposes a kind of transmission line external damage recognition method based on Mask R-CNN using the target detection Mask R-CNN model, and utilizes various types of external damage and transmission line data pictures for training, and achieves better experimental results. This method can realize the intelligent recognition of external damage of transmission channel, so as to reduce the occurrence of external damage accident of transmission channel.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Visible-thermal Video Object Detection (RGBT VOD) aims to detect objects in RGB and thermal video sequences and predict their categories and locations. One of the core aspects of this task is how to efficiently and effectively perform the fusion of multiple modalities. Due to the limitations of the imaging mechanism, under poor lighting conditions, the image quality of the RGB modality will dramatically decline, reducing the detection performance. Supplementing with thermal data can effectively mitigate these challenges. To fully utilize the complementary information of the two modalities, existing works on feature fusion of RGB and thermal images require an extremely complex modality interaction process, which is time and resource-intensive. When converting the data of the two modalities to the YCbCr color space represented by luminance and color, only luminance information exists in the thermal image. In contrast, the RGB image is rich in luminance and color information. However, due to the imaging mechanism, the luminance information of the RGB modality can easily be affected by bad lighting. To simplify the design of the multimodal interaction module, reduce model complexity, and take full advantage of complementary multimodal information, we consider multi-modality fusion in terms of luminance information for multimodal data and propose a Light-aware Luminance Adaptive Enhancement Network (LA2ENet). Specifically, we design a Light-aware Luminance Adaptive Enhancement Module (LA2EM), which can sense the light information in the scene. When the luminance information in the RGB image is drastically affected due to bad lighting, the module will adaptively introduce the luminance information in the stabilized thermal infrared image to supplement and improve the quality of the RGB image. After that, we only use the luminance-enhanced RGB image as the input to the model, making full use of the modal complementary information while reducing the complexity of the model. We conduct extensive experiments on the VT-VOD50 dataset, compared to the baseline, our LA2ENet improves the AP50 metric by 4.66% and is almost equal to the baseline in detection speed, which demonstrates the effectiveness and efficiency of our proposed method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
With the rapid development of VR technology, the capacity of VR audio-video files has increased significantly, demanding higher transmission performance. Traditional file transmission techniques cannot meet these needs due to the specificity of VR files. Therefore, research on synchronous transmission of VR audio-video files over wireless multipath based on bidirectional transmission has been proposed. Using DNA data storage encoding/decoding, VR files are uniformly encoded and stored. By dividing VR files into two sub-streams and considering security needs, a synchronous transmission model is established with corresponding sender and receiver algorithms. Experimental results show an average transmission rate of 900/s, a minimum packet loss rate of 1.4%, and an average delay of 0.25 ms, demonstrating the technique's applicability.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
With the further research of deep learning, power companies have gradually eliminated the prevention and control by manual inspection, and have adopted deep learning to identify the safety hazards of power equipment, and to detect the defects of power equipment more accurately and quickly. Firstly, this paper introduces the significant drawbacks of traditional power equipment identification methods in the field of power equipment identification, and then illuminates the advantages of deep learning in the field of power equipment, which leads to the YOLOv7 algorithm used in this study, and then provides a detailed introduction of the theory and concepts of the algorithm. Second, this paper constructs a dataset applicable to the identification of safety hazards in electrical equipment, and uses this dataset to train and test the YOLOv7 algorithm. Finally, by analyzing the experimental results, it is concluded that the algorithm has a high accuracy character and robustness in the method of identifying safety hazards of electrical equipment. This study provides a practical and feasible technical route and theoretical foundation for solving the problem of safety hidden hazard detection of electrical equipment using YOLOv7 algorithm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
With the rapid development of industrial technology, metal materials are widely used in production and daily life. More and more parts and workpieces are produced from factory assembly lines, and they are applied in different fields due to their different functions. However, with the increase of production, the traditional detection methods have been unable to keep up with the speed of production, and can not meet the growing detection requirements. In order to detect the products produced quickly and efficiently, new detection methods must be introduced. This study designs a workpiece detection system that integrates shooting, size measurement, and surface defect detection using a variable zoom lens. The system consists of hardware and control components. The hardware part uses the zoom lens instead of the ordinary lens to realize the zoom function of the monocular camera, so that the device can shoot the focal plane of different focal lengths. The control part can be divided into image acquisition, image processing, size measurement and defect detection. The system is mainly used in the detection of workpieces with two or more focal planes, because in reality, the workpiece with a single focal plane only accounts for a small part of all the workpieces, and most of the workpieces or parts have multiple focal planes. An experimental device was made by using the zoom lens, industrial camera, stage and light source used in the laboratory. In the study, the images captured by the industrial camera were used to obtain clear images of multiple focal planes by image definition algorithm and image fusion algorithm. Based on this image, the size measurement and defect detection of the workpiece were carried out. The experimental results show that the system can realize the clear imaging of multiple focal planes of the workpiece by using the zoom lens, and can realize the size measurement and defect detection of the workpiece on this basis. The system has potential application value in many directions such as workpiece cl.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recent years, the Transformer has achieved remarkable results in scene text spotting. This paper uses the classic encoderdecoder architecture to propose a Text Spotting model based on Transformer with bidirectional Explicit Points sampling (TSEP). As a sequence, the text contains rich semantic information in its forward and backward features. We model each text instance through bidirectional explicit point sampling. After decoding by the decoder, the positional and semantic information of the text is integrated into the explicit points. Therefore, a basic prediction head is capable of producing the boundary of the text region, the text content, and the corresponding confidence scores. Additionally, we propose a reference point feature enhancement module constructed using one-dimensional convolutions and MLP to address the spatial inductive bias of non-local self-attention in Transformers. Experimental results across various public datasets indicate that our model outperforms numerous other leading models.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Remote sensing images play a crucial role in large-scale surveillance and data analysis. To address challenges in monitoring and managing external threats to power optical fiber communication networks, this study integrates remote sensing images for network inspection. It proposes an anti-external damage system for power optical fiber communication networks based on remote sensing image target detection. The study enhances the YOLOv8 algorithm for improved object detection and utilizes a linear threshold model with specific risk factors to analyze and establish a robust anti-breakout system. Experimental tests and algorithm comparisons validate the system's feasibility and effectiveness.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Tongue diagnosis is one of the critical clinical diagnostic methods of Traditional Chinese Medicine, which has a long history. Tongue crack, as a clinical manifestation of tongue diagnosis, is closely related to diseases associated with the spleen and stomach. Hitherto, there have been several studies on tongue crack, among which, however, there are limited works focused on pixel-level classification, ending up with low segmentation accuracy. To solve this problem, a network called DSPR-DoubleU-Net is proposed by integrating the Position Attention Module (PAM), the Spatial Pyramid Pooling module (SPP), the residual structure, the Style-based Recalibration Module (SRM) and the DoubleU-Net, so that the network can extract richer contextual information. Several experiments are conducted on our constructed dataset which contains 351 sets of cracked tongue and non-cracked tongue images to demonstrate the network's effectiveness. The result shows that the proposed network can not only classify the cracked tongue and non-cracked tongue at the image level accurately but also segment the crack's edges more precisely than the other excellent general-purpose segmentation networks and state-of-the-art crack segmentation networks.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Intelligent recognition of insulator infrared image defects based on UAV inspection has important research value in transmission line inspection. This paper proposes a lightweight deep neural network that combines image semantic segmentation and object detection in response to the existence of limited equipment computational resources in practical application scenarios. Specifically, the module CA-DS-ASPP is proposed to improve DeepLabv3+ model which is trained based on ADAM optimizer to reduce the number of parameters and computational complexity and improve the segmentation accuracy. And an improved insulator defect detection model, named YOLOv5s-gv, is proposed based on lightweight VanillaNet and GhostConv to reduce the number of parameters and the amount of computation. Experimental results show that the lightweight insulator defect detection algorithm proposed in this paper largely improves the detection speed while maintaining the detection accuracy.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Instance segmentation is a key task to achieve intelligent robot sorting. In unstructured and complex environments, arbitrary stacking and mutual occlusion of target objects pose significant challenges to accurate segmentation. According to existing research, the performance of segmentation can be improved by using deep modalities that contain geometric information. In the underlying CNN network, the fixed convolutional kernel lacks the ability to capture geometric features in a local region and cannot effectively utilize the detailed information of the depth modality. In order to solve the problem, this paper proposes an RGB-D instance segmentation network based on depth difference convolution. By designing a depth difference convolution (DDC) mechanism, the interaction between modalities is enhanced by aggregating the intensity and gradient information of normal vectors and depth to perceive subtle geometric information in the local range of depth data. This mechanism allows structurally similar pixels to contribute more to the output, which improves the model's adaptability to the local sensory field. Experimental results show that the proposed model greatly improves the accuracy of the instance segmentation task compared to other models that use only RGB images.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
To investigate the electromyographic (EMG) information contained in the initial stage of finger movements, first, surface EMG signals were collected using an sEMG acquisition device. The raw data underwent preprocessing and extraction of instantaneous features. These instantaneous features were then input into the models for training. Subsequently, a portion of the feature value samples was extracted as a validation set for cross-validation. Indeed, the sEMG signals of finger movements exhibit instantaneous features that can maximally represent the movement patterns. The data results indicate that this study achieves good and rapid recognition results for basic finger movements. Consequently, it accelerates the recognition speed of finger movements in the system.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Infrared imaging can overcome visual limitations in low-light and complex environments, but the presence of mixed noise significantly affects its practical performance. Existing denoising methods struggle to handle the mixed noise in infrared images, while traditional visible light denoising techniques cannot be directly applied due to differences in noise types and imaging principles. To address this, we propose a Transformer-based denoising method that leverages its ability to learn long-range dependencies, effectively removing various types of noise. Experimental results demonstrate that this method outperforms traditional techniques and has broad application potential.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Segmenting the vertebrae is an important part of diagnosing spine diseases and planning surgery. The vertebra's complex anatomy, featuring similar shapes of neighboring bones and vague boundaries, presents substantial challenges for precise semantic segmentation. Traditional methods often depend heavily on local features, making it difficult to capture global context and resulting in less accurate recognition. Consequently, automated and efficient segmentation approaches are crucial to overcoming these issues. This study introduces an innovative Swin-BFB-UNet architecture derived from UNet to enhance vertebra segmentation. We include the Swin Transformer into the encoder to capture both global and local information, hence enhancing vertebra segmentation accuracy. The bottleneck fusion block (BFB) fuses multi-scale features and semantic information, enhancing the model's feature representation. Furthermore, we employ both Dice Loss and Cross-Entropy Loss to alleviate the adverse effects of foreground-background area imbalance. We conducted annotations and experiments on the lateral X-ray images in the BUU-LSPINE dataset. The experimental findings indicate that our model exceeds other leading methodologies, attaining a mDice of 90.3%, hence evidencing its proficiency in precise vertebra segmentation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this study, we propose SparseNetYOLOv8, an improved YOLOv8n model for small-object detection in sparse space. The main changes to the original architecture were incorporating GhostConv, CBAM, MobileViTBlock, and DWC within the Backbone, and also BiFPN and DySnakeConv in the Neck for improved feature fusion and edge detection respectively. These improvements together give a 10.6% mAP@0.5 then a mAP@0.5-0.95 increase of 7.8% and SAHI in Head which minimizes target omissions. These enhancements collectively yield a 10.6% improvement in mAP@0.5 and a 7.8% increase in mAP@0.5-0.95, while SAHI in the Head minimizes target omissions. Experimental results demonstrate the robustness of SparseNetYOLOv8 in comparison with other YOLO variants.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Monocular 3D lane detection is a key component of an autonomous driving perception system. Current mainstream methods are mostly based on inverse perspective mapping (IPM) for spatial transformation, but IPM assumes flat ground and static camera parameters, which makes it difficult to adapt to the complexity of the actual driving environment. We focus on a 3D lane detection method named Modified BEV-LaneDet (M-BEV-LaneDet) network. Firstly, inspired by the slender structure of lanes, the Bird’s-Eye-View Feature Aggregation Module (BEV-FAM) is proposed to enhance the extraction capability of lanes in the BEV features by expanding the convolutional receptive fields. Secondly, it proposes a lightweight Deep Layer Aggregation Module (DLAM) as the feature extraction backbone to effectively reduce the number of model parameters and improve the multi-scale feature aggregation capability. Experimental results on the OpenLane dataset demonstrate our method outperforms previous methods in terms of F-score, being 1.1% higher than the BEVLaneDet[1] network with the amount of parameters remaining largely unchanged.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Lane detection and tracking technologies play a crucial role in advancing autonomous driving and advanced driver assistance systems (ADAS). However, the existing methods struggle to achieve accurate detection across diverse environments. In this study, we proposed an adaptive lane detection and tracking algorithm based on the vanishing point. Initially, the edges of the lane lines were extracted using the ROI and 2D-FIR filter. Next, the vanishing point of the image is detected using Gabor filters, and adaptive inverse perspective transformation is achieved through coordinate conversion. Eventually, the detection of lane lines through the dynamic sliding window is attained by introducing the spatio-temporal sequence module, while tracking is performed with the adaptive Kalman filter by leveraging the lane width constraint. The method was tested on our dataset, which covers seven different environments and comprises over 4644 frames. The experimental results demonstrate that the accuracy of the detection is 92.9% and the center offset is close to 5.4 pixels.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Urban traffic environments pose significant challenges for automated vehicle detection, including fluctuating lighting, adverse weather, and complex road conditions. Visibility issues from fog, rain, and low light, alongside the prevalence of small vehicles in dense traffic, hinder detection accuracy. This study proposes an enhanced YOLOv5-based model for improved vehicle detection in complex urban traffic scenarios. Key contributions include integrating BiFPN for robust multi-scale feature fusion, adding an FFA module to boost detection under low-visibility conditions, and incorporating Image Adjustment Techniques (IAT) for preprocessing. Additionally, select YOLOv5 modules were upgraded to YOLOv8 components, yielding notable performance gains over the baseline model.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper proposes a stochastic human action prediction method based on denoising diffusion probability model to address the shortcomings of current motion prediction methods, such as insufficient diversity and deviation of prediction results from reasonable motion intervals. Firstly, a spatio-temporal Transformer denoising diffusion prediction network is constructed to effectively capture the local relationships between 3D joints in each frame, improving the consistency between predicted actions and historical action sequences. Secondly, the graph convolutional network (GCN) and GCN loss are introduced in the discrete cosine transform space to design a prediction action sequence refinement module, which refines the prediction results, improves the accuracy of the prediction results, and further solves problems such as prediction action lag and discontinuity. Finally, the method proposed in this paper was evaluated on the benchmark dataset, and the evaluation results showed that the proposed method was significantly superior to existing prediction methods in terms of accuracy and fidelity.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Facial expression recognition is an important application field in computer vision, by recognizing and analyzing the expressions on human faces, it can be widely used in many fields such as sentiment computing and security intelligent monitoring. Nowadays, the aging trend in China is very serious, and the issue of real-time monitoring of the safety of the left-behind elderly has become crucial. In this paper, RGS-YOLOv9, a facial expression recognition algorithm for improving yolov9, is proposed by replacing the original RepNCSPELAN4 module of yolov9 with RepNCSPELAN4_ UIB module and introducing SDI module as a connection. The experimental results show that mAP_0.5 on the Human Face Expression dataset improves by 3.3% and mAP_0.5:0.95 improves by 2.6%. This experiment provides new technical ideas for intelligent monitoring to improve the safety and security of the elderly.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Aiming the challenges of smaller board defects that are not easy to detect and insufficient feature extraction, We introduce a detection technique created especially for YOLOv8, It is accomplished by integrating the dynamic convolution of snakes into the C2f module. which results in the C2f-DSConv module, reduces the computational costs related to the model, and adds a SimAm attention mechanism to the neck network. Empirical results show that the improved YOLOv8n model map@0.5-0.95 improved to 69.1% on PCB dataset, which is 4.2% better than the benchmark model. The FPS achieves 49.3, enhancing detection performance while preserving the speed of the baseline model.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Many skin diseases have the potential to evolve into skin cancer. Hence, it is crucial to have an early diagnosis in order to prevent skin diseases from developing and spreading. Although current medical technologies based on lasers and photons can aid in diagnosis, they are costly and require a lot of time. The RGT-YOLOv9 algorithm is proposed in this study, for the detection and diagnosis of skin diseases. The algorithm introduces the R4Ghost module to replace YOLOv9's original RepNCSPELAN4 and introduces the Triplet attention mechanism into the YOLOv9 network. The experimental results show that mAP@0.5 on the Kuchbhe Dataset improves by 7.3% and mAP@0.5:0.95 improves by 5.58%. The improved algorithm provides visual technology support for early detection and diagnosis of skin diseases.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Multi-phase contrast-enhanced computed tomography (CECT) has been shown to effectively refine the segmentation accuracy for both tumors and organs. Nevertheless, enhancing the segmentation accuracy of the pancreas and pancreatic tumors using multi-phase CECT remains a challenge, which is crucial for subsequent clinical analysis and treatment. Current methods either simply concatenate multi-phase features along the channel direction or utilize attention mechanisms to perform weighted summation of features across phases, neglecting the contribution of phase-specific difference features to segmentation accuracy. In view of this, this paper proposes a multi-phase segmentation method for the pancreas and pancreatic tumors, improving segmentation accuracy by enhancing the phase-specific difference features. Additionally, we propose a training strategy to improve the segmentation accuracy of model when only single-phase CECT is available. The process achieved by distilling the learned knowledge of a segmentation model trained on multi-phase CECT. The proposed method and training strategy proved effective through experiments conducted on a privately collected multiphase CECT dataset of pancreatic tumors.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Skin diseases are one of the most common diseases in the world, and deep learning methods such as convolutional neural networks (CNN) have significantly improved the recognition of skin cancer. These methods usually use dermoscopy and clinical images, and the features are extracted and classified. However, existing methods do not take full advantage of the correlation between modes, and a single fusion method may lose feature information. Our proposed ASAF-Net can realize multi-mode information sharing, guide and collaborate data of different modes in the process of feature extraction. It also uses a multi-stage feature fusion strategy to integrate multi-modal information more effectively in the feature extraction and prediction stages. Our approach was evaluated on the multimodal, multilabel dermatology dataset Seven-Point Checklist, where ASAF-Net achieved an average accuracy of 79.1% and 87.5% for multiple classification tasks and diagnoses, respectively, exceeding other state-of-the-art methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Target detection and recognition are crucial research areas in computer vision, focusing on the identification of specific objects or targets within images or videos. This technology is vital for numerous real-world applications. In the engineering sector, Laser Spot Detection and Recognition Technology significantly enhances both productivity and accuracy. This research introduces a novel method for detecting and recognizing structured light array spots during calibration. The proposed technique leverages an adaptive threshold segmentation algorithm for image segmentation. It utilizes the energy distribution and positional relationships of the arrayed spots to accurately localize the spot array. Furthermore, it iteratively refines the position and distribution pattern of the spots using a planar projective relationship. This method can detect and identify regular arrays of light spots on calibration planes with varying reflectance and surface absorptivity, achieving detection accuracy up to 98% and localization accuracy at the sub-pixel level.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper introduces a natural gas visualization and leak monitoring system based on spectral video technology. The system combines multiple modalities, including spectral video imaging, laser scanning, and visible light camera, to achieve efficient detection and precise localization of gas leaks. The system can detect, quantify, and report gas leaks in real-time. Results indicate that the system offers high detection sensitivity and accuracy in detecting leaks within complex environments, providing reliable assurance for the safe operation of chemical plants.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Aiming at the problem of insufficient accuracy of existing robot position measurement, a binocular vision measurement method of robot end position based on orthogonal iteration to suppress the covariance error is studied. Firstly, a binocular measurement system is built, a binocular measurement model is derived, and the model is calibrated; secondly, a cooperative logo for measuring the robot position is designed, and the coordinates of the logo feature points in the image space are extracted by using sub-pixel straight line fitting; then, the feature points are passed through the binocular measurement model, and the spatial position of the feature points under the camera coordinate system is calculated, and the orthogonal iteration is used to suppress the conversion error and improve the settlement accuracy. Finally, the positional solution of the robot end is realized. Taking the six-degree-of-freedom robot as the experimental object, the end position measurement experiments are carried out, and the results show that the average error of the method in this paper for measuring the robot's end position is reduced from 0.165mm to 0.123mm, and the average errors of the posture in the X, Y, and Z directions are 0.135%deg;, 0.194°, and 0.173°.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.