The monocular 3D object detection methods based on Transformer have recently progressed significantly. However, most existing methods struggle to effectively handle fine-grained objects and complex scenes, particularly when capturing the features of occluded or small objects. To tackle these issues, we propose a monocular 3D object detector, CU-DETR, based on the MonoDETR framework. CU-DETR introduces the local-global fusion encoder to enhance local feature extraction and fusion and applies an uncertainty perturbation strategy in position encoding to enhance the model’s performance in handling complex scenes. Experimental results on the KITTI public dataset demonstrate that CU-DETR outperforms the MonoDETR.
CNN-Tranformer Hybrid models, combining the strengths of Transformers in capturing global context and CNNs in local feature extraction, have become an appealing direction in vision perception. However, hybrid models still face the significant challenge of minimizing computing expenses and balancing computational throughput and accuracy. This paper proposes an efficient CNN-Transformer hybrid model that improves throughput and memory consumption with high accuracy, named HTViT. Based on the three-stage architecture of LeViT, HTViT introduced a sparse cascaded group attention mechanism and global-local downsampling modules. The sparse cascaded group attention mechanism compresses the key and value in each group attention by the local aggregation to improve throughput and memory consumption. The global-local downsampling module introduces multi-scale convolution downsampling to enhance the local features and retain more valuable information to improve model performance. Comparison experiments with SOTA efficient hybrid models are conducted separately on CIFAR-10, STL-10, and Imagenette datasets. The experimental results demonstrate that HTViT significantly outperforms the baseline model LeViT and better balances the model size, throughput, memory consumption, and accuracy than other hybrid models.
Waste classification based on deep neural networks is up against the dataset deficiency. However, that is too expensive and time-consuming for collecting and labeling waste samples. We proposed an improved ResNet-18 model based on Model Agnostic Meta-Learning (MAML) to improve classification accuracy with a few-shot waste classification dataset. the feature extraction part of the improved model includes a convolution layer and four residual blocks; the classification part of the improved model includes a max-pooling layer and three fully connected layers. Moreover, GroupNorm is adopted to reduce the impact of different feature distributions normalization on the classification accuracy. With initial parameters from the MAML training on the Mini-ImageNet dataset, the model improve accuracy only with one training iteration results on few waste samples. The experiments verified the effectiveness of our model on the Mini-ImageNet dataset and a few-shot waste classification dataset
In cities, a large amount of municipal solid waste has impacted on the ecological environment significantly. Automatic and robust waste detection and classification is a promising and challenging problem in urban solid waste disposal. The performance of the classical detection and classification method is degraded by some factors, such as various occlusion and scale differences. To enhance the detection model robustness to occlusion and small items, we proposed a robust waste detection method based on a cascade adversarial spatial dropout detection network(Cascade ASDDN). The hard examples with occlusion in pyramid feature space are generated and used to adversarial training a detection network. Hard samples are generated by the spatial dropout module with Gradient-weighted Class Activation Mapping. The experiment verifies the effectiveness of our method on the 2020 Haihua AI challenge waste classification.
Low light object detection is a challenging problem in the field of computer vision and multimedia. Most available object detection methods are not accurate enough in low light conditions. The main idea of low light object detection is to add an image enhancement preprocessing module before the detection network. However, the traditional image enhancement algorithms may cause color loss, and the recent deep learning methods tend to take up too many computing resources. These methods are not suitable for low light object detection. We propose an accurate low light object detection method based on pyramid networks. A low-resolution pyramid enhancing light network is adopted to lessen computing and memory consumption. A super-resolution network based on attention mechanism is designed before Efficientdet to improve the detection accuracy. Experiments on the10K RAW-RGB low light image dataset show the effectiveness of the proposed method.
Recently, convolutional neural networks (CNN) have been widely used in object detection and image recognition for their effectiveness. Many highly accurate classification models based on CNN have been developed for various machine learning applications, but they generally computationally costly and require a hardware-based platform with super computing power and memory resources to implement the algorithm. In order to accurately and efficiently achieve object detection tasks using CNN on a system with limited resources such as a mobile device, we propose an innovative type of DenseNet, which is a lightweight convolutional neural network algorithm called Lite Asymmetric DenseNet (LADenseNet). Aiming to compress the CNN model complexity, we replace the 7 x 7 convolution and 3 x 3 max-pool with multiple 3 x 3 convolutions and a 2 x 2 max-pool in the initial down-sampling process to significantly reduce the computing cost. In the design of the dense blocks, channel splitting and channel shuffling are employed to enhance the information exchange of feature maps and improve the expressive ability of the network. We decompose the 3 x 3 convolution in the dense block into a combination of 3 x 1 and 1 x 3 convolutions, which can speed up the computations and extract more spatial features by using asymmetric convolutions. To evaluate the performance of the proposed approach we develop an experimental system in which LA-DenseNet is used to extract features and Single Shot MultiBox Detector (SSD) is used to detect objects. With VOC2007+12 as training and testing datasets, our model achieves comparable detection accuracy as YOLOv2 with a fraction of its computational cost and memory usage.
The developments of generative adversarial networks (GANs) make it possible to fill missing regions in broken images with convincing details. However, many existing approaches fail to keep the inpainted content and structures consistent with their surroundings. In this paper, we propose a GAN-based inpainting model which can restore the semantic damaged images visually reasonable and coherent. In our model, the generative network has an autoencoder frame and the discriminator network is a CNN classifier. Different from the classic autoencoder, we design a novel bottleneck layer in the middle of the autoencoder which is comprised of four dense-net blocks and each block contains vanilla convolution layers and dilated convolution layers. The kernels of dilated convolution are spread out and result in an effective enlargement of the receptive field. Thus the model can capture more widely semantic information to ensure the consistency of inpainted images. Furthermore, the multiplex of different level’s features in each dense-net block can help the model understand the whole image better to produce a convincing image. We evaluate our model over the public datasets CelebA and Stanford Cars with random position masks of different ratios. The effectiveness of our model is verified by qualitative and quantitative experiments.
Capturing four dimensional light field data sequentially using a coded aperture camera is an effective approach but
suffers from low signal noise ratio. Although multiplexing can help raise the acquisition quality, noise is still a big issue
especially for fast acquisition. To address this problem, this paper proposes a noise robust light field reconstruction
method. Firstly, scene dependent noise model is studied and incorporated into the light field reconstruction framework.
Then, we derive an optimization algorithm for the final reconstruction. We build a prototype by hacking an off-the-shelf
camera for data capturing and prove the concept. The effectiveness of this method is validated with experiments on the
real captured data.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.