Abstract
In this work, we developed and experimentally validated a novel model for external clustering validation to deal with huge data sets using Conditional Entropy index. The model allows clustering validation in a parallel and a distributed manner using Map-Reduce framework, it is termed MR-Centropy. The aim is to be able to scale with increasing dataset sizes when ground truth clustering is available. The proposed MR-Centropy is a three-jobs process where each job consists of Map and Reduce functions. Three jobs were necessary to gather all the statistics involved in the computation of the Conditional Entropy index. Each step in the proposed framework is done in parallel. Numerical tests on real and synthetic datasets demonstrate the effectiveness of our proposed model.