Conventional algorithms are often biased towards the majority class because their loss functions attempt to optimize quantities such as error rate, not taking the data distribution into consideration. Result: a trivial classifier that classifies every example as the majority class.
The package implements 85 variants of the Synthetic Minority Oversampling Technique (SMOTE). Besides the implementations, an easy to use model selection framework is supplied to enable the rapid evaluation of oversampling techniques on unseen datasets.
There are a number of methods available to oversample a dataset used in a typical classification problem (using a classification algorithm to classify a set of images, given a labelled training set of images). The most common technique is known as SMOTE.
ADASYN builds on the methodology of SMOTE, by shifting the importance of the classification boundary to those minority classes which are difficult. ADASYN uses a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn.
Down-sampling involves randomly removing observations from the majority class to prevent its signal from dominating the learning algorithm. The most common heuristic for doing so is resampling without replacement.
Cluster. Cluster centroids is a method that replaces cluster of samples by the cluster centroid of a K-means algorithm, where the number of clusters is set by the level of undersampling.
Tomek links. Tomek links remove unwanted overlap between classes where majority class links are removed until all minimally distanced nearest neighbor pairs are of the same class. Tomek links are pairs of instances of opposite classes who are their own nearest neighbors. Tomek’s algorithm looks for such pairs and removes the majority instance of the pair.
At the algorithm level, or after: Adjust the class weight (misclassification costs), adjust the decision threshold. Many machine learning toolkits have ways to adjust the “importance” of classes (classifiers that take an optional class_weight).
Change the metric.
Evaluating the classifier: Accuracy is not a good metric for imbalanced classes!!
Use a ROC curve
Don’t get hard classifications (labels) from your classifier (via score or predict). Instead, get probability estimates via proba or predict_proba
No matter what you do for training, always test on the natural (stratified) distribution your classifier is going to operate upon. Seesklearn.cross_validation.StratifiedKFold
For a singe metric (value): AUC, F1 (harmonic mean of precision and recall), Cohen’s Kappa (evaluation statistic that takes into account how much agreement would be expected by chance)
The following performance measures that can give more insight into the accuracy of the model than traditional classification accuracy:
Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
Precision: A measure of a classifiers exactness.
Recall: A measure of a classifiers completeness
F1 Score (or F-score): A weighted average of precision and recall.
Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
ROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.
Cost-Sensitive Training. For this tactic we use penalized learning algorithms that increase the cost of classification mistakes on the minority class. A popular algorithm for this technique is Penalized-SVM. During training, we can use the argument class_weight=‘balanced’ to penalize mistakes on the minority class by an amount proportional to how under-represented it is.
The experiments show that the bagging techniques generally outperform boosting, and hence in noisy data environments, bagging is the preferred method for handling class imbalance.