Performance Metrics

tidymodels
Published

May 29, 2026

Purpose

How to choose, compute, and read model performance metrics with yardstick. Covers regression, binary classification, and multiclass classification, plus the common ways each goes wrong.

Maps to Tidy Modeling with R, chapter 9.

Choose Metrics by Scenario

Pick the metric set before fitting. Reverse engineering it later invites cherry-picking.

Scenario Report these
Regression rmse + mae + rsq + rsq_trad + observed-vs-predicted scatter plot
Binary, balanced accuracy + mcc + roc_auc
Binary, medical diagnosis sens + spec + roc_auc + mcc
Binary, extreme imbalance mcc + pr_auc
Multiclass macro accuracy + macro_weighted roc_auc + per-class sens reported separately
Inference model Whatever inferential statistic you reported, plus at least one held-out predictive metric (accuracy or RMSE) against the majority-class / mean baseline

Default rule: report at least two metrics that measure different things. A single number always conceals something.

The Core Distinction: Accuracy vs Correlation

The most consequential statistical fact in this chapter:

  • RMSE / MAE measure accuracy — how far predictions sit from the truth.
  • measures correlation — whether predictions move with the truth.

A model whose predictions are systematically biased (e.g., always 50% of the true value) can have R² near 1 while RMSE is large. The predictions track the truth perfectly in shape, but every individual number is wrong. Conversely, a model with high RMSE but scattered errors can have a modest R².

This is why a single metric is never enough. Report at least one accuracy metric and one correlation metric, and look at the observed-vs-predicted scatter plot to see whether the diagonal of identity is respected.

Inference Models Also Need Predictive Fidelity

Even when a model is used to test relationships rather than to predict, report a held-out predictive metric alongside the p-values.

Example from the chapter: a logistic regression for Alzheimer’s status with significant two-way interactions reaches 73% accuracy on resampled data. The baseline rate of un-impaired patients is 68%. The model is only 5 percentage points better than always predicting “un-impaired”.

Statistical significance does not imply practical fit. Predictive fidelity calibrates how much the inferential conclusions deserve trust.

Regression Metrics

Metric Formula intuition Behavior
rmse \(\sqrt{\overline{(y - \hat y)^2}}\) Same units as outcome. Squares amplify outliers.
mae \(\overline{\lvert y - \hat y \rvert}\) Same units. Robust to outliers.
rsq \(\text{cor}(y, \hat y)^2\) Always in \([0, 1]\). Pure correlation.
rsq_trad \(1 - \text{SSE}/\text{SST}\) Can be negative (model worse than predicting the mean).
ccc Concordance correlation Penalizes both correlation loss and bias.
mape Mean absolute percentage error Scale-free, but undefined when truth is zero.

Diagnostic pattern: if rsq is high and rsq_trad is much lower, predictions correlate with truth but sit on a line that is not the 45° identity — a systematic bias. The scatter plot will show a tight cloud parallel to but offset from the diagonal.

No adjusted_rsq in yardstick. Adjusted R² exists to penalize degrees of freedom when the same data are used to fit and evaluate. Yardstick’s stance is that evaluation always happens on held-out data, so the adjustment is unnecessary.

Scale rule: if the model was fit on a transformed outcome (e.g., log10(Sale_Price)), compute metrics on the same transformed scale. Back-transforming first and then computing RMSE gives a different, harder-to-interpret quantity.

Binary Classification Metrics

Hard predictions (class labels)

Built from the confusion matrix:

              predicted
            +        -
actual +   TP       FN     sensitivity = TP / (TP + FN)
       -   FP       TN     specificity = TN / (FP + TN)
                           precision   = TP / (TP + FP)
Metric Use when
accuracy Classes are balanced. Misleading otherwise.
mcc General-purpose. Robust to imbalance. Uses all four cells.
f_meas Need a single number that emphasizes the positive class.
sens, spec Medical or risk-decision contexts. Report separately, not combined.
precision, recall Information-retrieval contexts.
conf_mat Returns the matrix object for direct inspection.

Why accuracy fails on imbalance: 99% healthy / 1% disease. Predicting “all healthy” gives 99% accuracy but sensitivity 0 — every patient missed. Always supplement with MCC, F1, or class-specific metrics on imbalanced data.

Why MCC is underused: it is the only single-number metric that uses all four cells of the confusion matrix and stays meaningful under heavy imbalance. Range \([-1, 1]\), with 0 = random.

Soft predictions (probabilities)

Metric Use when
roc_auc Default for ranking-quality assessment. Probability that the model scores a random positive higher than a random negative.
pr_auc Heavy imbalance — ROC AUC is inflated by the abundance of TN. PR AUC isolates positive-class performance.
gain_curve, lift_curve Scoring / ranking applications (marketing, risk).
roc_curve Returns the points for plotting. autoplot() available.

event_level — confirm the positive class

Yardstick treats the first factor level as the positive class by default.

f_meas(data, truth, predicted)                        # first level = positive
f_meas(data, truth, predicted, event_level = "second") # second level = positive

This differs from sklearn (uses second) and from base R conventions that assume 0/1 encoding. If the factor is c("control", "case"), "control" is treated as the positive class by default — the wrong choice for almost every medical application. Always confirm or set event_level explicitly.

Multiclass Classification Metrics

Binary metrics (sensitivity, precision, F1) need an averaging strategy to extend to three or more classes. Yardstick provides them via estimator:

estimator Mechanism Use when
"macro" Compute the metric one-vs-all for each class, then take the unweighted mean All classes equally important
"macro_weighted" Same, but the mean is weighted by class size Want to reflect the data distribution
"micro" Aggregate TP / FP / FN across classes, then compute one metric Majority-class behavior dominates
sensitivity(data, obs, pred, estimator = "macro")
sensitivity(data, obs, pred, estimator = "macro_weighted")
sensitivity(data, obs, pred, estimator = "micro")

Multiclass roc_auc uses the Hand-Till extension. All class-probability columns must be passed:

roc_auc(data, obs, VF, F, M, L)
roc_auc(data, obs, VF, F, M, L, estimator = "macro_weighted")

Reporting rule for multiclass: always state the estimator. “roc_auc = 0.85” is ambiguous without specifying macro vs macro_weighted vs micro. Additionally report per-class sensitivity so the reader sees both the averaged and the class-specific picture.

Yardstick Interface

Function shape:

metric_fn(data, truth, estimate)         # hard predictions
metric_fn(data, truth, prob_col, ...)    # soft predictions

Returns a tibble with .metric, .estimator, .estimate.

Combine multiple metrics into one callable:

my_metrics <- metric_set(rmse, rsq, mae)
my_metrics(preds, truth = y, estimate = .pred)

metric_set cannot mix hard-prediction and soft-prediction metrics in the same call (their input arguments differ). Make two sets if needed.

Works under group_by() — pass a grouped tibble and metrics are computed per group:

hpc_cv |>
  group_by(Resample) |>
  accuracy(obs, pred)

CV-fold-wise metrics are one pipe away.

Reading Checks

When reading a results section that reports model performance, ask:

  1. Is at least one accuracy metric and one correlation metric reported (regression)? Or hard-prediction and soft-prediction metric (classification)?
  2. On imbalanced binary data, is anything other than accuracy reported? If not, the headline number is uninformative.
  3. For multiclass, is the averaging estimator stated? Without it, the number cannot be interpreted.
  4. For binary, is event_level consistent with the analyst’s intended positive class?
  5. Was the metric computed on held-out data (resampled or test set), or on training data?
  6. For inference models, is any predictive metric reported alongside p-values?
  7. Is the baseline rate (majority class / mean prediction) given so the model can be compared to “do nothing”?
  8. If predictions were back-transformed before metric computation (e.g., from log scale to original), is that disclosed?

Common Pitfalls

  • Reporting only accuracy on imbalanced binary data.
  • Reporting roc_auc on extreme imbalance — use pr_auc.
  • Defaulting event_level to the first factor level when the analyst meant the second.
  • Multiclass metrics without naming the estimator.
  • A single regression metric with no scatter plot — systematic bias hides behind high R².
  • Comparing across studies with different averaging strategies (macro vs micro) without converting.
  • Computing metrics on the training data and calling that “model performance”.
  • Inference model with significant p-values but no predictive fidelity check against baseline.
  • Back-transforming log predictions to the original scale and then computing RMSE — the resulting number is harder to interpret and not comparable to RMSE on the modeled scale.

Note

The metric question is two questions at once: statistical (does this number measure what I want?) and methodological (is the number computed on data the model was not fit to?). The first depends on the problem; the second is non-negotiable.

For my own reading, the most useful habit is to refuse the headline number. Demand two metrics and a baseline. If a paper reports R² = 0.82 with no RMSE and no scatter plot, I do not yet know whether the model is well-calibrated or just systematically biased in the same direction. If a paper reports accuracy = 0.92 with no class balance and no MCC, I do not yet know whether the model is good or whether the majority class is 91% of the data.

For binary medical work specifically — the scenario most relevant to biomedical research — sens and spec reported separately are far more informative than any single composite. A composite (F1, accuracy) hides which type of error dominates, and the cost asymmetry between the two error types is the heart of every clinical decision.

Sources

  • Kuhn & Silge, Tidy Modeling with R, Chapter 9: Judging Model Effectiveness