Scoring metrics for binary classifications
There are various terms about the scoring metrics about classification machine learning models. Let’s summarize them here!
Before start: Definition of symbols
- H0: null hypothesis, true negative samples follow this distribution
- H1: alternative hypothesis, true positive samples follow this distribution
- p0: p-value of H0
- p1: p-value of H1
Focus on actual positives
True Positive Rate (TPR): Recall, Sensitivity, Hit rate, or Efficiency
- TP/(TP+FN)
- Meaning: Out of real positive samples, how many of them are predicted correctly?
- Equal to 1-p1.
- Sensitivity in medical diagnosis: For a critical disease which you shouldn’t miss, its diagnosis should have high sensitivity.
- Efficiency in nuclear/particle physics experiments: The number of produced particles are limited due to expensive cost of operation. To reduce statistical uncertainties in measurements with given number of particles, i.e. to minimize the statistical error bar in your physics plot with limited budget, you should collect produced particles as much as you can. Therefore, the efficiency of detectors and data collection system is always incorporated at the experimental design.
False Negative Rate (FNR): Miss rate
- FN/(TP+FN) = 1-TPR
- Meaning: Out of real positive samples, how many of them are predicted incorrectly?
- Equal to p1.
- Equal to the probability of the type 2 error.
Focus on actual negatives
True Negative Rate (TNR): Specificity or Selectivity
- TN/(TN+FP)
- Meaning: Out of real negative samples, how many of them are predicted correctly?
- Equal to 1-p0.
- Medical diagnosis with high specificity means if you are negative, you are probably healthy. If specificity is 100%, the positive result means you definitely have the disease.
False Positive Rate (FPR): Fall-out or False alarm rate
- FP/(TN+FP) = 1-TNR
- Meaning: Out of real negative samples, how many of them are predicted incorrectly?
- Equal to p0.
- Equal to the probability of the type 1 error.
Focus on a classified sample
Positive Predictive Value (PPV): Precision or Purity
- TP/(TP+FP)
- Meaning: Out of positive predictions, how many of them are actually positive?
- PPV gets larger as the fraction of actual positive samples gets larger.
- Purity in nuclear/particle physics experiments: The typical bandwidth of data collection systems are always much lower than the flow of produced particle signals. Since most of particle signals create garbage data, every experiment has a particle filtering system, called a “trigger”. Along with the efficiency, the purity is another performance metric of a trigger. Higher purity means less waste of data collection resources. Each experiment comes with clever ideas of event selection logis to maximize the purity of the trigger while keeping the efficiency up to a desired level for target statistics.
Focus on correct/wrong rate
Accuracy
- (TP+TN)/(TP+TN+FP+FN)
- Meaning: Out of all samples, how many of them are correctly classified?
- Simple metric just to tell how much the machine is right
- Accuracy Paradox: Not a fair measure for imbalanced dataset.
Combination of metrics
F1 score
- 2*TPR*PPV/(TPR+PPV)
- Meaning: Harmonic mean of the precision and recall.
- Increasing the precision conflicts with increasing the recall. If one imposes a tight (loose) threshold to increase the precision (recall), one ends up losing (gaining) more positive (negative) samples, which means the decreased recall (precision). The F1 score provides the middle ground, by adjusting the threshold to maximize this value.
ROC curve and AUC
The ROC (Receiver operating characteristic) curve describes the classification power of a model with varied thresholds.
Figure is from Wikipedia.
- X axis: False Positive Rate
- Y axis: True Positive Rate
- Red dotted line (y=x): Random guess (saying positive with a certain probability regardless of actual class)
The integration of the ROC curve is called “AUC” (Area Under Curve). The AUC is used as a performance metric of a classifier.
Leave a comment