An Empirical Study on Class Rarity in Big Data

Richard A. Bauder; Taghi M. Khoshgoftaar; Tawfiq Hasanin

doi:10.1109/ICMLA.2018.00125

Back

Conference proceeding

An Empirical Study on Class Rarity in Big Data

Richard A. Bauder, Taghi M. Khoshgoftaar and Tawfiq Hasanin

2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), pp.785-790

01/01/2018

DOI: https://doi.org/10.1109/ICMLA.2018.00125

Abstract

Computer Science

Computer Science, Artificial Intelligence

Computer Science, Theory & Methods

Engineering

Engineering, Electrical & Electronic

Science & Technology

Technology

The problem of class imbalance, especially the classification of rare cases, is an important area in machine learning. These rare cases are typically the ones of interest, thus accurate classification of these instances is required. Class imbalance is a well-studied area with relatively small datasets, but there is limited research focusing on both rarity and class imbalance with Big Data. In this study, we focus on the impact of rare class classification in the area of fraud detection using publicly available real-world Big Data from Medicare data sources. We demonstrate that rarity significantly degrades fraud detection performance over three machine learning models and nine datasets, with varying numbers of positive class instances. From these experiments, we show clear groupings indicating different levels of class imbalance and rarity. Furthermore, our results, showing decreasing performance with increasing rarity, are corroborated using three additional Medicare Big Data sources.

Metrics

1 Record Views

Details

Title: An Empirical Study on Class Rarity in Big Data
Creators - without role: Richard A. Bauder - Florida Atlantic University
Taghi M. Khoshgoftaar - Florida Atlantic University
Tawfiq Hasanin - Florida Atlantic University
Contributors - without role: M A Wani
M Kantardzic
M Sayedmouchaweh
J Gama
E Lughofer
Publication Details: 2018 17TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), pp.785-790
Publisher: IEEE
Number of pages: 6
Grant note: CNS-1427536 / NSF; National Science Foundation (NSF)
Identifiers: 9936002008331
Academic Unit: King Abdulaziz University; Prince Sultan University
Language: English
Resource Type: Conference proceeding