Adversarial patches in computer vision can be used, to fool deep neural networks and manipulate their decisionmaking process. One of the most prominent examples of adversarial patches are evasion attacks for object detectors. By covering parts of objects of interest, these patches suppress the detections and thus make the target object “invisible” to the object detector. Since these patches are usually optimized on a specific network with a specific train dataset, the transferability across multiple networks and datasets is not given. This paper addresses these issues and investigates the transferability across numerous object detector architectures. Our extensive evaluation across various models on two distinct datasets indicates that patches optimized with larger models provide better network transferability than patches that are optimized with smaller models.
This work examines the extent to which training data can be artificially generated for a target domain in an unsupervised manner to train an object detector in the target domain in the presence of little or no real training data. If the distributions of a source and target domain differ, but the same task is performed on both, this is referred to as domain adaptation. In the field of image processing, generative approaches are often used when attempting to transform the distribution of the source domain into the target domain. In this work, a generative method, a Denoising Diffusion Probabilistic Model, is investigated for the domain adaptation from the visible spectrum (VIS) to the thermal infrared (IR). Systematic extensions, such as the use of alternative noise schedules, were incorporated and evaluated. The partial results of the domain adaptation are significantly improved by the implemented extensions. In a subsequent step, a thermal infrared object detector is trained using the results of the domain adaptation. The publicly available Multi-scenario Multi-Modality Benchmark to Fuse Infrared and the recording vehicle MODISSA are used here for evaluation.
Center points are commonly the results of anchor-free object detectors. Starting from this initial representation, a regression scheme is utilized to determine a target point set to capture object properties such as enclosing bounding boxes and further attributes such as class labels.
When only trained for the detection tasks, the encoded center point feature representations are not well suited for tracking objects since the embedded features are not stable over time.
To tackle this problem, we present an approach of joint detection and feature embedding for multiple object tracking. The proposed approach applies an anchor-free detection model to pairs of images to extract single-point feature representations. To generate temporal stable features which are suitable for track association across short time intervals, auxiliary losses are applied to reduce the distance of tracked identities in the embedded feature space.
The abilities of the presented approach are demonstrated on real-world data reflecting prototypical object tracking scenarios.
Over the past years, tracking in the visible domain has seen rapid growth by exploiting Deep Neural Network (DNN) based methods, whereas, tracking in the Thermal Infrared (TIR) domain has seen a small interest. In this comparative study, we address tracking in a TIR maritime context for surveillance applications. Towards this end, we first compare the performances of traditional Single Object Trackers (SOTs) and recent DNN-based SOTs on a TIR maritime data set. Following this, we examine the sequences of the TIR data set causing difficulties for trackers and identify problematic attributes. Firstly, We use a group constituted of recent state-of-the-art DNN-based trackers and another group constituted of traditional trackers not employing DNN-based methods, and measure performance using the following metrics: Intersection over Union (IoU), center error, success rate, and robustness. Furthermore, we rank the trackers by taking into account their scores on IoU and robustness. The presented study shows that recent trackers exploiting DNNs methods for tracking perform on average better: over 24% on IoU and over 14% on robustness than their counterparts not utilizing DNN in their tracking process. Moreover, despite the provided improvement by using DNN-based trackers, a failure case analysis shows that clutter, occlusion handling, low-resolution and scale change of the target, are visual attributes that still remain challenging, requiring further improvement.
In this paper we present a new approach for tracking non-rigid, deformable objects by means of merging an on-line boosting-based tracker and a fast foreground background segmentation. We extend an on-line boosting- based tracker, which uses axes-aligned bounding boxes with fixed aspect-ratio as tracking states. By constructing a confidence map from the on-line boosting-based tracker and unifying this map with a confidence map, which is obtained from a foreground background segmentation algorithm, we build a superior confidence map. For constructing a rough confidence map of a new frame based on on-line boosting, we employ the responses of the strong classifier as well as the single weak classifier responses that were built before during the updating step. This confidence map provides a rough estimation of the object’s position and dimension. In order to refine this confidence map, we build a fine, pixel-wisely segmented confidence map and merge both maps together. Our segmentation method is color-histogram-based and provides a fine and fast image segmentation. By means of back-projection and the Bayes’ rule, we obtain a confidence value for every pixel. The rough and the fine confidence maps are merged together by building an adaptively weighted sum of both maps. The weights are obtained by utilizing the variances of both confidence maps. Further, we apply morphological operators in the merged confidence map in order to reduce the noise. In the resulting map we estimate the object localization and dimension via continuous adaptive mean shift. Our approach provides a rotated rectangle as tracking states, which enables a more precise description of non-rigid, deformable objects than axes-aligned bounding boxes. We evaluate our tracker on the visual object tracking (VOT) benchmark dataset 2016.
This paper is a continuation of the work of Becker et al.1 In their work, they analyzed the robustness of various background subtraction algorithms on fused video streams originating from visible and infrared cameras. In order to cover a broader range of background subtraction applications, we show the effects of fusing infrared-visible video streams from vibrating cameras on a large set of background subtraction algorithms. The effectiveness is quantitatively analyzed on recorded data of a typical outdoor sequence with a fine-grained and accurate annotation of the images. Thereby, we identify approaches which can benefit from fused sensor signals with camera jitter. Finally conclusions on what fusion strategies should be preferred under such conditions are given.
Existing tracking methods vary strongly in their approach and therefore have different strengths and weaknesses. For example, a single tracking algorithm may be good at handling variations in illumination, but does not cope well with deformation. Hence, their failures can occur on entirely different time intervals on the same sequence. One possible solution for overcoming limitations of a single tracker and for benefitting from individual strengths, is to run a set of tracking algorithms in parallel and fuse their outputs. But in general, tracking algorithms are not designed to receive feedback from a higher level fusion strategy or require a high degree of integration between individual levels. Towards this end, we introduce a fusion strategy serving the purpose of online single object tracking, for which no knowledge about individual tracker characteristics is needed. The key idea is to combine several independent and heterogeneous tracking approaches and to robustly identify an outlier subset based on the "Median Absolute Deviations" (MAD) measure. The MAD fusion strategy is very generic and only requires frame-based object bounding boxes as input. Thus, it can work with arbitrary tracking algorithms. Furthermore, the MAD fusion strategy can also be applied for combining several instances of the same tracker to form a more robust ensemble for tracking an object. The evaluation is done on public available datasets. With a set of heterogeneous, commonly used trackers we show that the proposed MAD fusion strategy improves the tracking results in comparison to a classical combination of parallel trackers and that the tracker ensemble helps to deal with the initialization uncertainty of a single tracker.
We are living in a world dependent on sophisticated technical infrastructure. Malicious manipulation of such critical infrastructure poses an enormous threat for all its users. Thus, running a critical infrastructure needs special attention to log the planned maintenance or to detect suspicious events. Towards this end, we present a knowledge-based surveillance approach capable of logging visual observable events in such an environment. The video surveillance modules are based on appearance-based person detection, which further is used to modulate the outcome of generic processing steps such as change detection or skin detection. A relation between the expected scene behavior and the underlying basic video surveillance modules is established. It will be shown that the combination already provides sufficient expressiveness to describe various everyday situations in indoor video surveillance. The whole approach is qualitatively and quantitatively evaluated on a prototypical scenario in a server room.
In recent years, the wide use of video surveillance systems has caused an enormous increase in the amount of data that has to be stored, monitored, and processed. As a consequence, it is crucial to support human operators with automated surveillance applications. Towards this end an intelligent video analysis module for real-time alerting in case of abandoned objects in public spaces is proposed. The overall processing pipeline consists of two major parts. First, person motion is modeled using an Interacting Multiple Model (IMM) filter. The IMM filter estimates the state of a person according to a finite-state, discrete-time Markov chain. Second, the location of persons that stay at a fixed position defines a region of interest, in which a nonparametric background model with dynamic per-pixel state variables identifies abandoned objects. In case of a detected abandoned object, an alarm event is triggered. The effectiveness of the proposed system is evaluated on the PETS 2006 dataset and the i-Lids dataset, both reflecting prototypical surveillance scenarios.
Aggregation of pixel based motion detection into regions of interest, which include views of single moving objects in a scene is an essential pre-processing step in many vision systems. Motion events of this type provide significant information about the object type or build the basis for action recognition. Further, motion is an essential saliency measure, which is able to effectively support high level image analysis. When applied to static cameras, background subtraction methods achieve good results. On the other hand, motion aggregation on freely moving cameras is still a widely unsolved problem. The image flow, measured on a freely moving camera is the result from two major motion types. First the ego-motion of the camera and second object motion, that is independent from the camera motion. When capturing a scene with a camera these two motion types are adverse blended together.
In this paper, we propose an approach to detect multiple moving objects from a mobile monocular camera system in an outdoor environment. The overall processing pipeline consists of a fast ego-motion compensation algorithm in the preprocessing stage. Real-time performance is achieved by using a sparse optical flow algorithm as an initial processing stage and a densely applied probabilistic filter in the post-processing stage. Thereby, we follow the idea proposed by Jung and Sukhatme. Normalized intensity differences originating from a sequence of ego-motion compensated difference images represent the probability of moving objects. Noise and registration artefacts are filtered out, using a Bayesian formulation. The resulting a posteriori distribution is located on image regions, showing strong amplitudes in the difference image which are in accordance with the motion prediction. In order to effectively estimate the a posteriori distribution, a particle filter is used.
In addition to the fast ego-motion compensation, the main contribution of this paper is the design of the probabilistic filter for real-time detection and tracking of independently moving objects. The proposed approach introduces a competition scheme between particles in order to ensure an improved multi-modality. Further, the filter design helps to generate a particle distribution which is homogenous even in the presence of multiple targets showing non-rigid motion patterns. The effectiveness of the method is shown on exemplary outdoor sequences.
Methods for automated person detection and person tracking are essential core components in modern security and surveillance systems. Most state-of-the-art person detectors follow a statistical approach, where prototypical appearances of persons are learned from training samples with known class labels. Selecting appropriate learning samples has a significant impact on the quality of the generated person detectors. For example, training a classifier on a rigid body model using training samples with strong pose variations is in general not effective, irrespective of the classifiers capabilities. Generation of high-quality training data is, apart from performance issues, a very time consuming process, comprising a significant amount of manual work. Furthermore, due to inevitable limitations of freely available training data, corresponding classifiers are not always transferable to a given sensor and are only applicable in a well-defined narrow variety of scenes and camera setups. Semi-supervised learning methods are a commonly used alternative to supervised training, in general requiring only few labeled samples. However, as a drawback semi-supervised methods always include a generative component, which is known to be difficult to learn. Therefore, automated processes for generating training data sets for supervised methods are needed. Such approaches could either help to better adjust classifiers to respective hardware, or serve as a complement to existing data sets. Towards this end, this paper provides some insights into the quality requirements of automatically generated training data for supervised learning methods. Assuming a static camera, labels are generated based on motion detection by background subtraction with respect to weak constraints on the enclosing bounding box of the motion blobs. Since this labeling method consists of standard components, we illustrate the effectiveness by adapting a person detector to cameras of a sensor network. While varying the training data and keeping the detection framework identical, we derive statements about the sample quality.
IR-sensors are mainly utilized in video surveillance systems in order to provide vision during nighttime and in diffuse lighting conditions. The dynamic range of IR-sensors usually exceeds that of conventional display devices. Hence, range compression associated with loss of information is always required. Range compression methods can be divided into global methods, which are based on the intensity distribution, and local methods focused on smaller regions of interest. In contrast to local methods, global methods are computationally efficient. Nevertheless, global methods have the drawback that fine details can be suppressed by intensity changes at image locations which are unrelated to the object of interest. In order to overcome these restrictions, we propose a method to render IR images based on high level object information. The overall processing pipeline consists of a contrast enhancement method, followed by object detection, and a range compression method that takes the location of objects into account. Here we use pedestrians as an exemplary object category. The output of the detector is a rectangular bounding box, centered at the person location. Restricting range compression to a person location, allows to display details on the person surface that most probably would remain undetected using global range compression methods. The proposed combination of rendering with high level information is intended to be integrated in a surveillance system to assist human operators. Towards this end, this paper provides some insights into the design of visualization tools.
In addition to detecting and tracking persons via video surveillance in public spaces like airports and train
stations, another important aspect of a situation analysis is the appearance of objects in the periphery of a
person. Not only from a military perspective, in certain environments, an unidentified armed person can be
an indicator for a potential threat. In order to become aware of an unidentified armed person and to initiate
counteractive measures, the ability to identify persons carrying weapons is needed. In this paper we present a
classification approach, which fits into an Implicit Shape Model (ISM) based person detection and is capable
to differentiate between unarmed persons and persons in an aiming body posture. The approach relies on
SIFT features and thus is completely independent of sensor-specific features which might only be perceivable
in the visible spectrum. For person representation and detection, a generalized appearance codebook is used.
Compared to a stand-alone person detection strategy with ISM, an additional training step is introduced that
allows interpretation of a person hypothesis delivered by the ISM. During training, the codebook activations and
positions of participated features are stored for the desired classes, in this case, persons in an aiming posture
and unarmed persons. With the stored information, one is able to calculate weight factors for every feature
participating in a person hypothesis in order to derive a specific classification model. The introduced model is
validated using an infrared dataset which shows persons in aiming and non-aiming body postures from different
angles.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.