In this paper we present a technique for the separation of harmonic sounds within real sound mixtures for automatic music transcription using Independent Subspace Analysis (ISA). The algorithm is based on the assumption that tones played by an instrument within polyphonic music consist of components that are statistically independent from components of other tones. The first step of the algorithm is a temporal segmentation into note events. Both features in the time domain and in the frequency domain are used to detect segment boundaries, which are represented by starting or decaying tones. Each segment is now examined using the ISA and a set of statistically independent components is calculated. One tone played by an instrument consists of the fundamental frequency and its harmonics. Usually, the ISA results in more independent components than played notes, because not all harmonics are separated to the component containing their fundamental frequencies. Some harmonics are separated in components of its own. Using the Kullback-Leibler divergence components belonging together are grouped. A note classification, which is trained for piano music at the time, is the last step of the algorithm. Results show, that statistic independence is a promising measure for separating sounds into single notes using ISA as a step towards automatic music transcription.
Nowadays, the video coding standards for object based video coding and the tools for multimedia content description are available. Hence, we have powerful tools that can be used for content-based video coding, description, indexing and organization. In the past, it was difficult to extract higher level semantics, such as video objects, automatically. In this paper, we present a novel approach to moving object region detection. For this purpose, we developed a framework which applies bidirectional global motion estimation and compensation in order to identify potential foreground object regions. After spatial image segmentation, the results are assigned to image segments, and further diffused over the image region. This enables robust object region detection also in cases, where the investigated object does not move completely all the time. Finally, each image segment can be classified as being either situated in the foreground or in the background. Subsequent region merging delivers foreground object masks which can be used in order to define the region-of-attention for content based video coding, but also for contour based object classification.
With the multimedia content description interface MPEG-7, we have powerful tools for video indexing, based on which content-based search-and-retrieval with respect to separate shots and scenes in video can be performed. We especially focus on the parametric motion descriptor. The motion parameters, being finally coded in the descriptor values, require robust content extraction methods. In this paper, we introduce our approach to the extraction of global motion from video. For this purpose, we apply a constraint feature point selection and matching approach in order to find correspondences in images. Subsequently, an M-estimator is used for robust estimation of the motion model parameters. We evaluate the performance of our approach using affine and biquadratic motion models, also in comparison with a standard least-median-of-squares based approach to global motion estimation.
In this paper we present an audio thumbnailing technique based on audio segmentation by similarity search. The segmentation is performed on MPEG-7 low level audio feature descriptors as a growing source of multimedia meta data. Especially for database applications or audio-on-demand services this technique could be very helpful, because there is no need to have access to the probably copyright protected original audio material. The result of the similarity search is a matrix which contains off-diagonal stripes representing similar regions, which are usually the refrains of a song and thus a very suitable segment to be used as audio thumbnail. Using the a priori knowledge that we search off-diagonal stripes which must represent several seconds of audio data and that the adjustment of the stripes must be characteristically, we implemented a filter to enhance the structure of the similarity matrix and to extract a relevant segment as an audio thumbnail.
The huge amount of multimedia data produced worldwide requires
annotation in order to enable universal content access and to
provide content-based search-and-retrieval functionalities. Since
manual video annotation can be time consuming, automatic
annotation systems are required. We review recent approaches to
content-based indexing and annotation of videos for different kind
of sports and describe our approach to automatic annotation of
equestrian sports videos. We especially concentrate on MPEG-7
based feature extraction and content description, where we apply
different visual descriptors for cut detection. Further, we
extract the temporal positions of single obstacles on the course
by analyzing MPEG-7 edge information. Having determined single
shot positions as well as the visual highlights, the information
is jointly stored with meta-textual information in an MPEG-7
description scheme. Based on this information, we generate content
summaries which can be utilized in a user-interface in order to
provide content-based access to the video stream, but further for
media browsing on a streaming server.
The amount of multimedia data available worldwide is increasing every day. There is a vital need to annotate multimedia data in order to allow universal content access and to provide content-based search-and-retrieval functionalities. Since supervised video annotation can be time consuming, an automatic solution is appreciated. We review recent approaches to content-based indexing and annotation of videos for different kind of sports, and present our application for the automatic annotation of equestrian sports videos. Thereby, we especially concentrate on MPEG-7 based feature extraction and content description. We apply different visual descriptors for cut detection. Further, we extract the temporal positions of single obstacles on the course by analyzing MPEG-7 edge information and taking specific domain knowledge into account. Having determined single shot positions as well as the visual highlights, the information is jointly stored together with additional textual information in an MPEG-7 description scheme. Using this information, we generate content summaries which can be utilized in a user front-end in order to provide content-based access to the video stream, but further content-based queries and navigation on a video-on-demand streaming server.
Object shape features are powerful when used in similarity search-&-retrieval and object recognition because object shape is usually strongly linked to object functionality and identity. Many applications, including those concerned with visual objects retrieval or indexing, are likely to use shape features. Those systems have to cope with scaling, rotation, deformation and partial occlusion of the objects to be described. The ISO standard MPEG-7 contains different shape descriptors, where we focus especially on the region-shape descriptor. Since we found, that the region-shape descriptor is not very robust against partial occlusion, we propose a slightly changed feature extraction method, which is based on central-moments. Further, we compare our method with the original region-shape implementation and show that, applying the proposed changes, the robustness of the region-shape descriptor against partial occlusions can be significantly increased.
Due to the rapidly growing multimedia content available on the internet it is highly desirable to index multimedia data automatically and to provide content based search and retrieval functionalities. The first step in order to describe and annotate video data is to split the sequences into sub-shots which are related to semantic units. This paper addresses unsupervised scene change detection and keyframe selection of video sequences. Unlike other methods this is performed by using a standardized multimedia content description of the video data. We apply the MPEG-7 scalable color descriptor and the edge histogram descriptor for shot boundary detection and show that this method performs well. Furthermore, we propose to store the output data of our system in a video segment description scheme to provide simple but efficient search and retrieval functionalities for video scenes based on color features.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.