The Effects of Random Undersampling with Simulated Class Imbalance for Big Data

Tawfiq Hasanin; Taghi M. Khoshgoftaar; IEEE

doi:10.1109/IRI.2018.00018

Back

Conference proceeding

The Effects of Random Undersampling with Simulated Class Imbalance for Big Data

Tawfiq Hasanin, Taghi M. Khoshgoftaar and IEEE

2018 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), pp.70-79

01/01/2018

DOI: https://doi.org/10.1109/IRI.2018.00018

Abstract

Computer Science

Computer Science, Information Systems

Computer Science, Theory & Methods

Engineering

Engineering, Electrical & Electronic

Science & Technology

Technology

With the recent explosion of big data, real-world data are increasingly being affected by larger degrees of class imbalance, likely hindering Machine Learning algorithm performance. The contribution of our work is to show that good classification performance on big data, across different application domains, can be achieved without too much alteration to the original dataset. In order to demonstrate good classification performance with big data, we process four datasets, from different domains, generating several imbalanced variants. Five new imbalanced big datasets, with target positive classes of 10%, 1%, 0.1%, 0.01%, and 0.001%, were created from each original full dataset to study information loss introduced in traditional random undersampling. Random undersampling is applied to balance the binary class in each of the created imbalance datasets generating 50: 50 class ratios. All models were built using the Random Forest classifier, using the Spark and H2O machine learning libraries, and performance was recorded to find good ratios for undersampling big data without discarding too much of the majority class. We provide a comparison of all models created from the prepared datasets, generating 4,488 models. We conclude that, in terms of imbalanced data, from 0.1% to 1.0% of the minority class, adequate performance is obtained even when compared to 10% or even 100% of the full balanced data set. Moreover, when random undersampling the negative class to 50: 50, there are similarities regarding the average performance compared to that of using the entire big dataset.

Metrics

1 Record Views

Details

Title: The Effects of Random Undersampling with Simulated Class Imbalance for Big Data
Creators - without role: Tawfiq Hasanin - Florida Atlantic University
Taghi M. Khoshgoftaar - Florida Atlantic University
IEEE
Publication Details: 2018 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), pp.70-79
Publisher: IEEE
Number of pages: 10
Grant note: CNS-1427536 / NSF; National Science Foundation (NSF)
Identifiers: 9939957208331
Academic Unit: King Abdulaziz University
Language: English
Resource Type: Conference proceeding