Abstract
The problem of class imbalance, especially the classification of rare cases, is an important area in machine learning. These rare cases are typically the ones of interest, thus accurate classification of these instances is required. Class imbalance is a well-studied area with relatively small datasets, but there is limited research focusing on both rarity and class imbalance with Big Data. In this study, we focus on the impact of rare class classification in the area of fraud detection using publicly available real-world Big Data from Medicare data sources. We demonstrate that rarity significantly degrades fraud detection performance over three machine learning models and nine datasets, with varying numbers of positive class instances. From these experiments, we show clear groupings indicating different levels of class imbalance and rarity. Furthermore, our results, showing decreasing performance with increasing rarity, are corroborated using three additional Medicare Big Data sources.