Abstract
The paper introduces to the digitization and features extraction processes of large volume of imaging documents and stored as images using mechanisms of big data and cloud technology. So, layout analysis, image representation, feature extraction and transformation huge amount of the prepared document images are presented in this paper.
Accordingly, an efficient way reliable and highly clustering functionality of these document images will be focused. Consequently, image-based extractor using "document image similarity" is the main methodology to apply this paper. Many tasks have been proposed to contribute such idea and also to support retrieval performance. Different methods such as layout analysis, image representation, features vectors creation, similarity measure, accuracy computation and document image retrieval will be presented.