Abstract
An efficient full-text search is achieved by indexing the raw data with an additional 20 to 30 percent storage cost. In the context of Big Data, this additional storage space is huge and introduces challenges to entertain full-text search queries with good performance. It also incurs overhead to store, manage, and update the large size index. In this paper, we propose and evaluate a method to minimize the index size to offer full-text search over Big Data using an automatic extractive-based text summarization method. To evaluate the effectiveness of the proposed approach, we used two real-world datasets. We indexed actual and summarized datasets using Apache Lucene and studied average simple overlapping, Spearman's rho correlation, and average ranking score measures of search results obtained using different search queries. Our experimental evaluation shows that automatic text summarization is an effective method to reduce the index size significantly. We obtained a maximum of 82% reduction in index size with 42% higher relevance of the search results using the proposed solution to minimize the full-text index size.