Performance Metrics
Purpose
How to choose, compute, and read model performance metrics with yardstick. Covers regression, binary classification, and multiclass classification, plus the common ways each goes wrong.
Maps to Tidy Modeling with R, chapter 9.
Choose Metrics by Scenario
Pick the metric set before fitting. Reverse engineering it later invites cherry-picking.
| Scenario | Report these |
|---|---|
| Regression | rmse + mae + rsq + rsq_trad + observed-vs-predicted scatter plot |
| Binary, balanced | accuracy + mcc + roc_auc |
| Binary, medical diagnosis | sens + spec + roc_auc + mcc |
| Binary, extreme imbalance | mcc + pr_auc |
| Multiclass | macro accuracy + macro_weighted roc_auc + per-class sens reported separately |
| Inference model | Whatever inferential statistic you reported, plus at least one held-out predictive metric (accuracy or RMSE) against the majority-class / mean baseline |
Default rule: report at least two metrics that measure different things. A single number always conceals something.
The Core Distinction: Accuracy vs Correlation
The most consequential statistical fact in this chapter:
- RMSE / MAE measure accuracy — how far predictions sit from the truth.
- R² measures correlation — whether predictions move with the truth.
A model whose predictions are systematically biased (e.g., always 50% of the true value) can have R² near 1 while RMSE is large. The predictions track the truth perfectly in shape, but every individual number is wrong. Conversely, a model with high RMSE but scattered errors can have a modest R².
This is why a single metric is never enough. Report at least one accuracy metric and one correlation metric, and look at the observed-vs-predicted scatter plot to see whether the diagonal of identity is respected.
Inference Models Also Need Predictive Fidelity
Even when a model is used to test relationships rather than to predict, report a held-out predictive metric alongside the p-values.
Example from the chapter: a logistic regression for Alzheimer’s status with significant two-way interactions reaches 73% accuracy on resampled data. The baseline rate of un-impaired patients is 68%. The model is only 5 percentage points better than always predicting “un-impaired”.
Statistical significance does not imply practical fit. Predictive fidelity calibrates how much the inferential conclusions deserve trust.
Regression Metrics
| Metric | Formula intuition | Behavior |
|---|---|---|
rmse |
\(\sqrt{\overline{(y - \hat y)^2}}\) | Same units as outcome. Squares amplify outliers. |
mae |
\(\overline{\lvert y - \hat y \rvert}\) | Same units. Robust to outliers. |
rsq |
\(\text{cor}(y, \hat y)^2\) | Always in \([0, 1]\). Pure correlation. |
rsq_trad |
\(1 - \text{SSE}/\text{SST}\) | Can be negative (model worse than predicting the mean). |
ccc |
Concordance correlation | Penalizes both correlation loss and bias. |
mape |
Mean absolute percentage error | Scale-free, but undefined when truth is zero. |
Diagnostic pattern: if rsq is high and rsq_trad is much lower, predictions correlate with truth but sit on a line that is not the 45° identity — a systematic bias. The scatter plot will show a tight cloud parallel to but offset from the diagonal.
No adjusted_rsq in yardstick. Adjusted R² exists to penalize degrees of freedom when the same data are used to fit and evaluate. Yardstick’s stance is that evaluation always happens on held-out data, so the adjustment is unnecessary.
Scale rule: if the model was fit on a transformed outcome (e.g., log10(Sale_Price)), compute metrics on the same transformed scale. Back-transforming first and then computing RMSE gives a different, harder-to-interpret quantity.
Binary Classification Metrics
Hard predictions (class labels)
Built from the confusion matrix:
predicted
+ -
actual + TP FN sensitivity = TP / (TP + FN)
- FP TN specificity = TN / (FP + TN)
precision = TP / (TP + FP)
| Metric | Use when |
|---|---|
accuracy |
Classes are balanced. Misleading otherwise. |
mcc |
General-purpose. Robust to imbalance. Uses all four cells. |
f_meas |
Need a single number that emphasizes the positive class. |
sens, spec |
Medical or risk-decision contexts. Report separately, not combined. |
precision, recall |
Information-retrieval contexts. |
conf_mat |
Returns the matrix object for direct inspection. |
Why accuracy fails on imbalance: 99% healthy / 1% disease. Predicting “all healthy” gives 99% accuracy but sensitivity 0 — every patient missed. Always supplement with MCC, F1, or class-specific metrics on imbalanced data.
Why MCC is underused: it is the only single-number metric that uses all four cells of the confusion matrix and stays meaningful under heavy imbalance. Range \([-1, 1]\), with 0 = random.
Soft predictions (probabilities)
| Metric | Use when |
|---|---|
roc_auc |
Default for ranking-quality assessment. Probability that the model scores a random positive higher than a random negative. |
pr_auc |
Heavy imbalance — ROC AUC is inflated by the abundance of TN. PR AUC isolates positive-class performance. |
gain_curve, lift_curve |
Scoring / ranking applications (marketing, risk). |
roc_curve |
Returns the points for plotting. autoplot() available. |
event_level — confirm the positive class
Yardstick treats the first factor level as the positive class by default.
f_meas(data, truth, predicted) # first level = positive
f_meas(data, truth, predicted, event_level = "second") # second level = positiveThis differs from sklearn (uses second) and from base R conventions that assume 0/1 encoding. If the factor is c("control", "case"), "control" is treated as the positive class by default — the wrong choice for almost every medical application. Always confirm or set event_level explicitly.
Multiclass Classification Metrics
Binary metrics (sensitivity, precision, F1) need an averaging strategy to extend to three or more classes. Yardstick provides them via estimator:
estimator |
Mechanism | Use when |
|---|---|---|
"macro" |
Compute the metric one-vs-all for each class, then take the unweighted mean | All classes equally important |
"macro_weighted" |
Same, but the mean is weighted by class size | Want to reflect the data distribution |
"micro" |
Aggregate TP / FP / FN across classes, then compute one metric | Majority-class behavior dominates |
sensitivity(data, obs, pred, estimator = "macro")
sensitivity(data, obs, pred, estimator = "macro_weighted")
sensitivity(data, obs, pred, estimator = "micro")Multiclass roc_auc uses the Hand-Till extension. All class-probability columns must be passed:
roc_auc(data, obs, VF, F, M, L)
roc_auc(data, obs, VF, F, M, L, estimator = "macro_weighted")Reporting rule for multiclass: always state the estimator. “roc_auc = 0.85” is ambiguous without specifying macro vs macro_weighted vs micro. Additionally report per-class sensitivity so the reader sees both the averaged and the class-specific picture.
Yardstick Interface
Function shape:
metric_fn(data, truth, estimate) # hard predictions
metric_fn(data, truth, prob_col, ...) # soft predictionsReturns a tibble with .metric, .estimator, .estimate.
Combine multiple metrics into one callable:
my_metrics <- metric_set(rmse, rsq, mae)
my_metrics(preds, truth = y, estimate = .pred)metric_set cannot mix hard-prediction and soft-prediction metrics in the same call (their input arguments differ). Make two sets if needed.
Works under group_by() — pass a grouped tibble and metrics are computed per group:
hpc_cv |>
group_by(Resample) |>
accuracy(obs, pred)CV-fold-wise metrics are one pipe away.
Reading Checks
When reading a results section that reports model performance, ask:
- Is at least one accuracy metric and one correlation metric reported (regression)? Or hard-prediction and soft-prediction metric (classification)?
- On imbalanced binary data, is anything other than accuracy reported? If not, the headline number is uninformative.
- For multiclass, is the averaging
estimatorstated? Without it, the number cannot be interpreted. - For binary, is
event_levelconsistent with the analyst’s intended positive class? - Was the metric computed on held-out data (resampled or test set), or on training data?
- For inference models, is any predictive metric reported alongside p-values?
- Is the baseline rate (majority class / mean prediction) given so the model can be compared to “do nothing”?
- If predictions were back-transformed before metric computation (e.g., from log scale to original), is that disclosed?
Common Pitfalls
- Reporting only
accuracyon imbalanced binary data. - Reporting
roc_aucon extreme imbalance — usepr_auc. - Defaulting
event_levelto the first factor level when the analyst meant the second. - Multiclass metrics without naming the
estimator. - A single regression metric with no scatter plot — systematic bias hides behind high R².
- Comparing across studies with different averaging strategies (macro vs micro) without converting.
- Computing metrics on the training data and calling that “model performance”.
- Inference model with significant p-values but no predictive fidelity check against baseline.
- Back-transforming log predictions to the original scale and then computing RMSE — the resulting number is harder to interpret and not comparable to RMSE on the modeled scale.
Note
The metric question is two questions at once: statistical (does this number measure what I want?) and methodological (is the number computed on data the model was not fit to?). The first depends on the problem; the second is non-negotiable.
For my own reading, the most useful habit is to refuse the headline number. Demand two metrics and a baseline. If a paper reports R² = 0.82 with no RMSE and no scatter plot, I do not yet know whether the model is well-calibrated or just systematically biased in the same direction. If a paper reports accuracy = 0.92 with no class balance and no MCC, I do not yet know whether the model is good or whether the majority class is 91% of the data.
For binary medical work specifically — the scenario most relevant to biomedical research — sens and spec reported separately are far more informative than any single composite. A composite (F1, accuracy) hides which type of error dominates, and the cost asymmetry between the two error types is the heart of every clinical decision.
Sources
- Kuhn & Silge, Tidy Modeling with R, Chapter 9: Judging Model Effectiveness