Scoring metrics for binary classifications

April 27, 2022

There are various terms about the scoring metrics about classification machine learning models. Let’s summarize them here!

Before start: Definition of symbols

H0: null hypothesis, true negative samples follow this distribution
H1: alternative hypothesis, true positive samples follow this distribution
p0: p-value of H0
p1: p-value of H1

Focus on actual positives

True Positive Rate (TPR): Recall, Sensitivity, Hit rate, or Efficiency

TP/(TP+FN)
Meaning: Out of real positive samples, how many of them are predicted correctly?
Equal to 1-p1.
Sensitivity in medical diagnosis: For a critical disease which you shouldn’t miss, its diagnosis should have high sensitivity.
Efficiency in nuclear/particle physics experiments: The number of produced particles are limited due to expensive cost of operation. To reduce statistical uncertainties in measurements with given number of particles, i.e. to minimize the statistical error bar in your physics plot with limited budget, you should collect produced particles as much as you can. Therefore, the efficiency of detectors and data collection system is always incorporated at the experimental design.

False Negative Rate (FNR): Miss rate

FN/(TP+FN) = 1-TPR
Meaning: Out of real positive samples, how many of them are predicted incorrectly?
Equal to p1.
Equal to the probability of the type 2 error.

Focus on actual negatives

True Negative Rate (TNR): Specificity or Selectivity

TN/(TN+FP)
Meaning: Out of real negative samples, how many of them are predicted correctly?
Equal to 1-p0.
Medical diagnosis with high specificity means if you are negative, you are probably healthy. If specificity is 100%, the positive result means you definitely have the disease.

False Positive Rate (FPR): Fall-out or False alarm rate

FP/(TN+FP) = 1-TNR
Meaning: Out of real negative samples, how many of them are predicted incorrectly?
Equal to p0.
Equal to the probability of the type 1 error.

Focus on a classified sample

Positive Predictive Value (PPV): Precision or Purity

TP/(TP+FP)
Meaning: Out of positive predictions, how many of them are actually positive?
PPV gets larger as the fraction of actual positive samples gets larger.
Purity in nuclear/particle physics experiments: The typical bandwidth of data collection systems are always much lower than the flow of produced particle signals. Since most of particle signals create garbage data, every experiment has a particle filtering system, called a “trigger”. Along with the efficiency, the purity is another performance metric of a trigger. Higher purity means less waste of data collection resources. Each experiment comes with clever ideas of event selection logis to maximize the purity of the trigger while keeping the efficiency up to a desired level for target statistics.

Focus on correct/wrong rate

Accuracy

(TP+TN)/(TP+TN+FP+FN)
Meaning: Out of all samples, how many of them are correctly classified?
Simple metric just to tell how much the machine is right
Accuracy Paradox: Not a fair measure for imbalanced dataset.

Combination of metrics

F1 score

2*TPR*PPV/(TPR+PPV)
Meaning: Harmonic mean of the precision and recall.
Increasing the precision conflicts with increasing the recall. If one imposes a tight (loose) threshold to increase the precision (recall), one ends up losing (gaining) more positive (negative) samples, which means the decreased recall (precision). The F1 score provides the middle ground, by adjusting the threshold to maximize this value.

ROC curve and AUC

The ROC (Receiver operating characteristic) curve describes the classification power of a model with varied thresholds.

Figure is from Wikipedia.

X axis: False Positive Rate
Y axis: True Positive Rate
Red dotted line (y=x): Random guess (saying positive with a certain probability regardless of actual class)

The integration of the ROC curve is called “AUC” (Area Under Curve). The AUC is used as a performance metric of a classifier.

Share on

Twitter Facebook LinkedIn