Abstract
With the recent explosion of big data, real-world data are increasingly being affected by larger degrees of class imbalance, likely hindering Machine Learning algorithm performance. The contribution of our work is to show that good classification performance on big data, across different application domains, can be achieved without too much alteration to the original dataset. In order to demonstrate good classification performance with big data, we process four datasets, from different domains, generating several imbalanced variants. Five new imbalanced big datasets, with target positive classes of 10%, 1%, 0.1%, 0.01%, and 0.001%, were created from each original full dataset to study information loss introduced in traditional random undersampling. Random undersampling is applied to balance the binary class in each of the created imbalance datasets generating 50: 50 class ratios. All models were built using the Random Forest classifier, using the Spark and H2O machine learning libraries, and performance was recorded to find good ratios for undersampling big data without discarding too much of the majority class. We provide a comparison of all models created from the prepared datasets, generating 4,488 models. We conclude that, in terms of imbalanced data, from 0.1% to 1.0% of the minority class, adequate performance is obtained even when compared to 10% or even 100% of the full balanced data set. Moreover, when random undersampling the negative class to 50: 50, there are similarities regarding the average performance compared to that of using the entire big dataset.