The human-object interaction (HOI) detection task refers to localizing humans, localizing objects, and predicting the interactions between each human-object pair. HOI is considered one of the fundamental steps in truly understanding complex visual scenes. For detecting HOI, it is important to utilize relative spatial configurations and object semantics to find salient spatial regions of images that highlight the interactions between human object pairs. This issue is addressed by the novel self-attention based guided transformer network, GTNet. GTNet encodes this spatial contextual information in human and object visual features via self-attention while achieving state of the art results on both the V-COCO1 and HICO-DET2 datasets. Code is available online∗.
In this paper we present methods for scene understanding, localization and classification of complex, visually
heterogeneous objects from overhead imagery. Key features of this work include: determining boundaries of objects
within large field-of-view images, classification of increasingly complex object classes through hierarchical
descriptions, and exploiting automatically extracted hypotheses about the surrounding region to improve classification
of a more localized region. Our system uses a principled probabilistic approach to classify increasingly
larger and more complex regions, and then iteratively uses this automatically determined contextual information
to reduce false alarms and misclassifications.
Error correction codes of suitable redundancy are used for ensuring perfect data recovery in noisy channels. For
iterative decoding based methods, the decoder needs to be initialized with proper confidence values, called the
log likelihood ratios (LLRs), for all the embedding locations. If these confidence values or LLRs are accurately
initialized, the decoder converges at a lower redundancy factor, thus leading to a higher effective hiding rate.
Here, we present an LLR allocation method based on the image statistics, the hiding parameters and the noisy
channel characteristics. It is seen that this image-dependent LLR allocation scheme results in a higher data-rate,
than using a constant LLR across all images. The data-hiding channel parameters are learned from the image
histogram in the discrete cosine transform (DCT) domain using a linear regression framework. We also show
how the effective data-rate can be increased by suitably increasing the erasure rate at the decoder.
Current image re-sampling detectors can reliably detect re-sampling in JPEG images only up to a Quality Factor (QF) of
95 or higher. At lower QFs, periodic JPEG blocking artifacts interfere with periodic patterns of re-sampling. We add a
controlled amount of noise to the image before the re-sampling detection step. Adding noise suppresses the JPEG
artifacts while the periodic patterns due to re-sampling are partially retained. JPEG images of QF range 75-90 are
considered. Gaussian/Uniform noise in the range of 28-24 dB is added to the image and the images thus formed are
passed to the re-sampling detector. The detector outputs are averaged to get a final output from which re-sampling can
be detected even at lower QFs.
We consider two re-sampling detectors - one proposed by Poposcu and Farid [1], which works well on uncompressed
and mildly compressed JPEG images and the other by Gallagher [2], which is robust on JPEG images but can detect only
scaled images. For multiple re-sampling operations (rotation, scaling, etc) we show that the order of re-sampling matters.
If the final operation is up-scaling, it can still be detected even at very low QFs.
We present further extensions of yet another steganographic scheme (YASS), a method based on embedding data in randomized locations so as to resist blind steganalysis. YASS is a JPEG steganographic technique that hides data in the discrete cosing transform (DCT) coefficients of randomly chosen image blocks. Continuing to focus on JPEG image steganography, we present, in this paper, a further study on YASS with the goal of improving the rate of embedding. Following are the two main improvements presented in this paper: (i) a method that randomizes the quantization matrix used on the transform domain coefficients, and (ii) an iterative hiding method that utilizes the fact that the JPEG "attack" that causes errors in the hidden bits is actually known to the encoder. We show that using both these approaches, the embedding rate can be increased while maintaining the same level of undetectability (as the original YASS scheme). Moreover, for the same embedding rate, the proposed steganographic schemes are more undetectable than the popular matrix embedding based F5 scheme, using features proposed by Pevny and Fridrich for blind steganalysis.
In this paper we attempt to quantify the "active" steganographic capacity - the maximum rate at which data can be hidden, and correctly decoded, in a multimedia cover subject to noise/attack (hence - active), perceptual distortion criteria, and statistical steganalysis. Though work has been done in studying the capacity of data hiding as well as the rate of perfectly secure data hiding in noiseless channels, only very recently have all the constraints been considered together. In this work, we seek to provide practical estimates of steganographic capacity in natural images, undergoing realistic attacks, and using data hiding methods available today. We focus here on the capacity of an image data hiding channel characterized by the use of statistical restoration to satisfy the constraint of perfect security (under an i.i.d. assumption), as well as JPEG and JPEG-2000 attacks. Specifically we provide experimental results of the statistically secure hiding capacity on a set of several hundred images for hiding in a pre-selected band of frequencies, using the discrete cosine and wavelet transforms, where a perturbation of the quantized transform domain terms by ±1 using the quantization index modulation scheme, is considered to be perceptually transparent. Statistical security is with respect to the matching of marginal statistics of the quantized transform domain terms.
A video "fingerprint" is a feature extracted from the video that should represent the video compactly, allowing faster search without compromising the retrieval accuracy. Here, we use a keyframe set to represent a video, motivated by the video summarization approach. We experiment with different features to represent each keyframe with the goal of identifying duplicate and similar videos. Various image processing operations like blurring, gamma correction, JPEG compression, and Gaussian noise addition are applied on the individual video frames to generate duplicate videos. Random and bursty frame drop errors of 20%, 40% and 60% (over the entire video) are also applied to create more noisy "duplicate" videos. The similar videos consist of videos with similar content but with varying camera angles, cuts, and idiosyncrasies that occur during successive retakes of a video. Among the feature sets used for comparison, for duplicate video detection, Compact Fourier-Mellin Transform (CFMT) performs the best while for similar video retrieval, Scale Invariant Feature Transform (SIFT) features are found to be better than comparable-dimension features. We also address the problem of retrieval of full-length videos with shorter-length clip queries. For identical feature size, CFMT performs the best for video retrieval.
We have investigated adaptive mechanisms for high-volume transform-domain data hiding in MPEG-2 video
which can be tuned to sustain varying levels of compression attacks. The data is hidden in the uncompressed domain
by scalar quantization index modulation (QIM) on a selected set of low-frequency discrete cosine transform
(DCT) coefficients. We propose an adaptive hiding scheme where the embedding rate is varied according to the
type of frame and the reference quantization parameter (decided according to MPEG-2 rate control scheme) for
that frame. For a 1.5 Mbps video and a frame-rate of 25 frames/sec, we are able to embed almost 7500 bits/sec.
Also, the adaptive scheme hides 20% more data and incurs significantly less frame errors (frames for which the
embedded data is not fully recovered) than the non-adaptive scheme. Our embedding scheme incurs insertions
and deletions at the decoder which may cause de-synchronization and decoding failure. This problem is solved
by the use of powerful turbo-like codes and erasures at the encoder. The channel capacity estimate gives an idea
of the minimum code redundancy factor required for reliable decoding of hidden data transmitted through the
channel. To that end, we have modeled the MPEG-2 video channel using the transition probability matrices
given by the data hiding procedure, using which we compute the (hiding scheme dependent) channel capacity.
Print-scan resilient data hiding finds important applications in document security, and image copyright protection. In this paper, we build upon our previous work on print-scan resilient data hiding with the goal of providing a mathematical foundation for computing information-theoretic limits, and guiding design of more complicated hiding schemes allowing higher volume of embedded data. A model for print-scan process is proposed, which has three main components: a) effects due to mild cropping, b) colored high-frequency noise, and c) non-linear effects. It can be shown that cropping introduces unknown but smoothly varying phase shift in the image spectrum. A new hiding method called Differential Quantization Index Modulation (DQIM) is proposed in which, information is hidden in the phase spectrum of images by quantizing the difference in phase of adjacent frequency locations. The unknown phase shift would get cancelled when the difference is taken. Using the proposed DQIM hiding in phase, we are able to survive the print-scan process with several hundred information bits hidden into the images.
In this paper we study steganalysis, the detection of hidden data. Specifically we focus on detecting data hidden in grayscale images with spread spectrum hiding. To accomplish this we use a statistical model of images and estimate the detectability of a few basic spread spectrum methods. To verify the results of these findings, we create a tool to discriminate between natural "cover" images and "stego" images (containing hidden data) taken from a diverse database. Existing steganalysis schemes that exploit the spatial memory found in natural images are particularly effective. Motivated by this, we include inter-pixel dependencies in our model of image pixel probabilities and use an appropriate statistical measure for the security of a steganography system subject to optimal hypothesis testing. Using this analysis as a guide, we design a tool for detecting hiding on various spread spectrum methods. Depending on the method and power of the hidden message, we correctly detect the presences of hidden data in about 95% of images.
This paper presents an overview of our recent work on managing image and video data. The first half of the paper describes a representation for the semantic spatial layout of video frames. In particular, Markov random fields are used to characterize the spatial arrangement of frame tiles that are labeled using support vector machine classifiers. The representation is shown to support similarity retrieval at the semantic level as demonstrated in a prototype video management system. The second half of the paper describes a method for efficiently computing nearest neighbor queries in high-dimensional feature spaces in a relevance feedback framework.
Image registration is an important operation in remote sensing applications that basically involves the identification of many control points in the images. As the manual identification of control points may be time-consuming and tedious several automatic techniques have been developed. This paper describes a system for automatic registration and mosaic of remote sensing images under development at the Division of Image Processing (National Institute for Space Research - INPE) and the Vision Lab (Electrical & Computer Engineering department, UCSB). Three registration algorithms, which showed potential for multisensor or temporal image registration, have been implemented. The system is designed to accept different types of data and information provided by the user which speed up the processing or avoid mismatched control points. Based on a statistical procedure used to characterize good and bad registration, the user can stop or modify the parameters and continue the processing. Extensive algorithm tests have been performed by registering optical, radar, multi-sensor, high-resolution images and video sequences. Furthermore, the system has been tested by remote sensing experts at INPE using full scene Landsat, JERS-1, CBERS-1 and aerial images. An online demo system, which contains several examples that can be carried out using web browser, is available.
A practical method for creating a high dimensional index structure that adapts to the data distribution and scales well with the database size, is presented. Typical media descriptors, such as texture features, are high dimensional and are not uniformly distributed in the feature space. The performance of many existing methods degrade if the data is not uniformly distributed. The proposed method offers an efficient solution to this problem. First, the data's marginal distribution along each dimension is characterized using a Gaussian mixture model. The parameters of this model are estimated using the well known Expectation-Maximization method. These model parameters can also be estimated sequentially for on-line updating. Using the marginal distribution information, each of the data dimensions can be partitioned such that each bin contains approximately an equal number of objects. Experimental results on a real image texture data set are presented. Comparisons with existing techniques, such as the well known VA-File, demonstrate a significant overall improvement.
A new method of automatically determining control points for registration and fusion of multispectral images is presented. This method was motivated by the question of whether it is possible to computationally assess the quality of optical flow estimates at various points throughout the image without knowing the true flow field. Somewhat surprisingly the answer is yes and is determined by the norm of the least squares operator associated with the (windowed) optical flow equations. This approach has several advantages. First it shows the danger of using the condition number of the optical flow equations to measure the reliability of the computed flow. Second the method isolates points in the image corresponding to maximum reliability. These points in turn can be used as control points for registration and fusion without actually computing the optical flow and indeed only require a single frame for computation. Since this computation only requires a few operations per pixel it is very fast. The control points are defined as the minima of the norm of the least squares operator and as such enjoy a great deal of invariance with respect to the regional intensity changes seen in multispectral images. For this reason they are ideal for multispectral registration. A multiscale version of this method has been developed that allows a coarse to fine control point decomposition for suppressing the negative effects of noise and clutter. Various applications are presented demonstrating the utility of this approach for real world images including multispectral satellite images and dual spectral IR image sequences. For the latter sequence we were able to obtain subpixel motion estimates that were accurate to within one percent of the true motion.
A new technique for embedding image data that can be recovered in the absence of the original host image, is presented. The data to be embedded, referred to as the signature data, is inserted into the host image in the DCT domain. The signature DCT coefficients are encoded using a lattice coding scheme before embedding. Each block of host DCT coefficients is first checked for its texture content and the signatured codes are appropriately inserted depending on a local texture measure. Experimental results indicate that high quality embedding is possible, with no visible distortions. Signature images can be recovered even when the embedded data is subject to significant lossy JPEG compression.
KEYWORDS: Video, Image segmentation, Feature extraction, Video compression, Video surveillance, Video processing, Motion estimation, Data storage, Motion models, Information visualization
There is a growing need for new representations of video that allow not only compact storage of data but also content-based functionalities such as search and manipulation of objects. We present here a prototype system, called NeTra-V, that is currently being developed to address some of these content related issues. The system has a two-stage video processing structure: a global feature extraction and clustering stage, and a local feature extraction and object-based representation stage. Key aspects of the system include a new spatio-temporal segmentation and object-tracking scheme, and a hierarchical object-based video representation model. The spatio-temporal segmentation scheme combines the color/texture image segmentation and affine motion estimation techniques. Experimental results show that the proposed approach can handle large motion. The output of the segmentation, the alpha plane as it is referred to in the MPEG-4 terminology, can be used to compute local image properties. This local information forms the low-level content description module in our video representation. Experimental results illustrating spatio- temporal segmentation and tracking are provided.
An approach to embedding gray scale images using a discrete wavelet transform is proposed. The proposed scheme enables using signature images that could be as much as 25% of the host image data and hence could be used both in digital watermarking as well as image/data hiding. In digital watermarking the primary concern is the recovery or checking for signature even when the embedded image has been changed by image processing operations. Thus the embedding scheme should be robust to typical operations such as low-pass filtering and lossy compression. In contrast, for data hiding applications it is important that there should not be any visible changes to the host data that is used to transmit a hidden image. In addition, in both data hiding and watermarking, it is desirable that it is difficult or impossible for unauthorized persons to recover the embedded signatures. The proposed scheme provides a simple control parameter that can be tailored to either hiding or watermarking purposes, and is robust to operations such as JPEG compression. Experimental results demonstrate that high quality recovery of the signature data is possible.
Currently there are quite a few image retrieval systems that use color and texture as features to search images. However, by using global features these methods retrieve results that often do not make much perceptual sense. It is necessary to constrain the feature extraction within homogeneous regions, so that the relevant information within these regions can be well represented. This paper describes our recent work on developing an image segmentation algorithm which is useful for processing large and diverse collections of image data. A compact color feature representation which is more appropriate for these segmented regions is also proposed. By using the color and texture features and a region-based search, we achieve a very good retrieval performance compared to the entire image based search.
We propose a new method for indexing large image databases. The method incorporates neural network learning algorithms and pattern recognition techniques to construct an image pattern dictionary. Image retrieval is then formulated as a process of dictionary search to compute the best matching codeword, which in turn indexes into the database items. Experimental results are presented.
This paper suggests a wavelet transform based multiresolution approach as a viable solution to the problems of storage, retrieval and browsing in a large image database. We also investigate the performance of an optimal uniform mean square quantizer in representing all transform coefficients to ensure that the disk space necessary for storing a multiresolution representation does not exceed that of the original image. In addition, popular wavelet filters are compared with respect to their reconstruction performance and computational complexity. We conclude that, for our application, the Haar wavelet filters offer an appropriate compromise between reconstruction performance and computational efforts.
In this paper we combine image feature extraction with indexing techniques for efficient retrieval in large texture images databases. A 2D image signal is processed using a set of Gabor filters to derive a 120 component feature vector representing the image. The feature components are ordered based on the relative importance in characterizing a given texture pattern, and this facilitates the development of efficient indexing mechanisms. We propose three different sets of indexing features based on the best feature, the average feature and a combination of both. We investigate the tradeoff between accuracy and discriminating power using these different indexing approaches, and conclude that the combination of best feature and the average feature gives the best results.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.