Segmentation and detection of text in document images

Darko Zelenika, Janez Povh, Bernard Ženko

2015

PDF

Abstract

Text detection in document images plays an important role in optical character recognition systems and is a challenging task. The proposed text detection method uses self-adjusting bottom-up segmentation algorithm to segment a document image into a set of connected components. The segmented connected components are then described in terms of 27 features and a machine learning algorithm is used to classify these components as text or non-text. We have collected a dataset (called ASTRoID), which contains 500 images of text blocks and 500 images of non-text blocks in order to test the method. We empirically compare performance of the proposed text detection method with seven different machine learning algorithms; the best performance is obtained with the radial support vector machine.

Type

Conference paper

Publication

International Conference on Computer Recognition Systems