ABI3 : Balancing Act ~ Understanding Imbalanced Datasets

Friday. December 24, 2021 - 1 min

Imbalance datasets are common in many applications such as medical and x-ray images where a moderate imbalance happens with a less than normal proportion having an illness and majority are not or in anomaly detection in manufacturing where perhaps the failure rate is 1 out of 10,000 batches fail. In such cases, the machine learning algorithm will probably over-classify the larger set due to the increased prior probability. Class imbalances can be identified by examining the target calss distribution via a histogram. If the class is imbalanced, then a problem exists.

There are many ways of handling imbalanced datasets and the key solutions to achieve this include :

Modifying class weights
Penalizing the models
Oversampling the minority class : This can be done through the SMOTE technique which creates synthetic observations through proven algorithms - Synthetic Minority Oversampling Technique
Undersampling the majority class