Paper
28 January 2004 Natural language insensitive short textual string compression
Author Affiliations +
Abstract
There are applications (such as Internet search engines) where short textual strings, for example abstracts or pieces of Web pages, need to be compressed independently of each other. The usual adaptive compression algorithms perform poorly on these short strings due to the lack of necessary data to learn. In this manuscript, we introduce a compression algorithm targeting short text strings; e.g., containing a few hundred symbols. We also target natural language insensitivity, to facilitate its robust compression and fast implementation. The algorithm is based on the following findings. Applying the move-to-front transform (MTFT) after the Burrows-Wheeler transform (BWT) brings the short textual strings to a "normalized form" where the distribution of the resulting "ranks" has a shape similar over the set of natural language strings. This facilitates the use of a static coding method with few variations, which we call shortBWT, where no on-line learning is needed, to encode the ranks. Finally, for short strings, shortBWT runs very fast because the strings fit into the cache of most current computers. The introduction for this paper will review the mathematical bases of BWT and MTF, it will also review our recently published metric for rapidly pre-characterizing the compressibility of such short textual strings when using these transforms.
© (2004) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Cornel Constantinescu, Jennifer Q. Trelewicz, and Ronald B. Arps "Natural language insensitive short textual string compression", Proc. SPIE 5208, Mathematics of Data/Image Coding, Compression, and Encryption VI, with Applications, (28 January 2004); https://doi.org/10.1117/12.507758
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Genetic algorithms

Detection and tracking algorithms

Electroluminescence

Algorithm development

Computer programming

Internet

Algorithms

Back to Top