Paper
3 April 1997 Duplicate document detection
Larry Spitz
Author Affiliations +
Proceedings Volume 3027, Document Recognition IV; (1997) https://doi.org/10.1117/12.270062
Event: Electronic Imaging '97, 1997, San Jose, CA, United States
Abstract
In document image filing applications it is important to be able to recognize whether a particular document has already been entered into the system either as an individual document or as an inclusion in another document.Document images could be matched on the basis of layout or contents.However, matching of layout may not be effective when style is strictly controlled. We develop a document 'handle' which is stored along with the document image. The handle is simply a character shape coded representation of the image after the figures and tables have ben removed. Character shape coding is a method of identifying individual character images as members of one of a small number of classes. This process is computationally inexpensive and tolerant of differing generations of photocopying, skew and scanner characteristics. When a new document is entered into the system, its handle is computed and compared against al of the extant handles using a normalized Levenshtein metric. We demonstrate the ability to detect duplicate documents comprising single and multiple pages.
© (1997) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Larry Spitz "Duplicate document detection", Proc. SPIE 3027, Document Recognition IV, (3 April 1997); https://doi.org/10.1117/12.270062
Lens.org Logo
CITATIONS
Cited by 24 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Databases

Image compression

Image processing

Computing systems

Data storage

Detection and tracking algorithms

Image quality

Back to Top