Abstract
•Issuing undeserved sick leaves is a serious ethical problem in hospitals and a real burden on countries' economy.•Naïve Bayes (NB), Logistic Regression (LR) and K-Nearest Neighbor are used to detect undeserved short sick leaves from hospital data.•Random Under-Sampling is applied to improve the performance of the three classifiers under data imbalance.•Logistic Regression outperformed other classifiers before under-sampling whereas Naïve Bayes showed better performance after under-sampling.•The recommended data under sampling ratio is 34:66 i.e., 34 % of the records are deserved sick leaves and 66 % are undeserved sick leaves.
Artificial intelligence and Machine Learning are nowadays playing an important role in improving medical services. One of the services which needs the dedication of such techniques is the attribution of sick leaves. This need is raised by the observed abuse of the service. Undeserved sick-leaves can be obtained by employees and students from hospitals, either by pretending sickness or by exploiting connections with medical staff or physicians. In this paper, undeserved short sick-leaves detection problem is investigated under data imbalance. A highly skewed real dataset where 93 % of the records are deserved sick leaves and only 7% are undeserved sick leaves is used. Classification techniques namely Naïve Bayes (NB), Logistic Regression (LR) and K-Nearest Neighbor (K-NN) are built, tested, and compared. Also, Random under-sampling technique is utilized for the remedy of data imbalance. That is, four proportions of the dataset with different ratios among the classes (deserved Vs undeserved) have been created. Each classification technique is evaluated under each of the sampled data proportions considering a set of measures such as accuracy, specificity, and Area Under-Curve (AUC). The best performance on the original data is shown by LR classifier (accuracy = 97 %, specificity = 76 % and AUC = 87 %), followed by NB than K-NN. However, on the sampled data, NB outperformed both LR and K-NN with an accuracy up to 90 %, specificity up to 94 % and AUC up to 88 %. Also, it has been proven that the best data sampling ratio is 34 % for deserved sick leaves and 66 % for undeserved sick leaves.