Data Spending

tidymodels

Published

May 28, 2026

Purpose

Data spending is the idea that the available data in a modeling project are a finite budget.

The same dataset may be needed for several different jobs:

estimating model parameters
choosing a model
tuning hyperparameters
assessing final model performance

If the same observations are reused for all of these jobs without discipline, the model can look better than it really is. The risk is not only overfitting, but also biased performance estimates and subtle information leakage.

The practical question is:

How should I allocate the data so model development and final evaluation stay separate?

Training Set

The training set is the data used for model development.

It is the place to:

fit models
estimate parameters
try feature engineering strategies
compare candidate models
tune preprocessing choices
tune model hyperparameters

So the training set is not just where a model is fit once. It is the main workspace for building the model.

training set = model building data

Any decision that changes the model should be made using training data, or using resampling procedures inside the training data.

Test Set

The test set has a different role. It is held back until the end and used to estimate how the chosen model performs on new data.

test set = final performance check

The test set should not be used to choose the model, tune parameters, select features, or revise preprocessing.

The important rule is not literally that the test set can only be evaluated once by code. The rule is:

Do not use test set results to go back and change the model.

Once the test set influences model development, it is no longer an independent estimate of performance.

A Basic Split

For the Ames housing data, the book starts with an 80/20 split:

library(tidymodels)
tidymodels_prefer()

set.seed(501)

ames_split <- initial_split(ames, prop = 0.80)
ames_split

Here:

prop = 0.80 sends 80% of rows to the training set
the remaining 20% become the test set
set.seed() makes the random split reproducible

The object ames_split is not a data frame. It is an rsplit object that stores the partition information.

To extract the actual data frames:

ames_train <- training(ames_split)
ames_test  <- testing(ames_split)

Both data frames have the same columns as ames, but different rows.

Why Random Splitting Is Not Always Enough

Simple random splitting assumes that the training set and test set will be similar enough by chance.

This can fail when important parts of the outcome distribution are rare.

In classification, this often appears as class imbalance. For example, if the positive class is only 5% of the data, a random split may put too few positive cases in the training set or test set.

In regression, the same issue can happen when the outcome is strongly skewed. Rare high or low outcome values may be poorly represented in one split.

Stratified Splitting

Stratified splitting tries to preserve the outcome distribution across the training and test sets.

For classification, the split is performed within each outcome class.

For regression, the continuous outcome can be binned into groups, such as quartiles, and the split is performed within those bins.

The Ames outcome, Sale_Price, is right-skewed. There are more inexpensive houses than expensive houses. If expensive houses are underrepresented in the training set, the model may learn that part of the price range poorly.

The book therefore uses:

set.seed(502)

ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <- testing(ames_split)

The key argument is:

strata = Sale_Price

This asks rsample to keep the distribution of Sale_Price more similar between training and testing.

There is usually little downside to stratified splitting. It is often a good default when the outcome is imbalanced or skewed.

One limitation is that rsample::initial_split() can stratify using only one column.

Time-Ordered Data

Random splitting is not appropriate when time order matters.

For time series or other time-dependent data, the realistic task is often:

use earlier data to predict later data

Randomly placing future observations into the training set can make the evaluation too optimistic.

For this situation, rsample provides:

initial_time_split()

This function assumes that the data have already been sorted in the right time order. The first part becomes the training set, and the later part becomes the test set.

How Much Data Goes Where?

There is no universal split proportion.

If the training set is too small:

parameter estimates may be poor
the model may not learn enough structure
rare patterns may be missed

If the test set is too small:

performance estimates become unstable
metrics have high uncertainty
the final assessment is less trustworthy

An 80/20 split is a common starting point, not a rule. The right split depends on the sample size, task difficulty, outcome rarity, and cost of uncertain performance estimates.

A test set should be avoided only when the data are pathologically small. Otherwise, an unbiased final assessment is usually worth the data cost.

Validation Set

The test set is held back for final assessment. This creates a practical question:

How do I choose or tune a model before looking at the test set?

One answer is a validation set.

A validation set is used during model development to estimate performance before the final test set is touched. It is common in neural network and deep learning workflows, especially for monitoring overfitting and early stopping.

For example:

training error keeps decreasing
validation error starts increasing
the model is probably beginning to overfit

In rsample, a three-way split can be created with:

set.seed(52)

ames_val_split <- initial_validation_split(ames, prop = c(0.6, 0.2))
ames_val_split

This means:

60% training
20% validation
20% testing

The data can be extracted with:

ames_train <- training(ames_val_split)
ames_val   <- validation(ames_val_split)
ames_test  <- testing(ames_val_split)

In tidymodels, validation sets are closely related to resampling methods that operate inside the training set. Later workflow steps use these methods to tune and compare models without touching the test set.

Multilevel Data

Data should be split at the level of the independent experimental unit.

For Ames housing data, one row represents one property. It is reasonable to treat each property as an independent unit.

Many datasets are not like this.

Examples:

repeated measurements from the same patient
multiple visits from the same subject
multiple technical replicates from the same batch
multiple samples from the same tree, tissue, donor, or specimen
many single cells from the same biological sample

In these cases:

row != independent unit

If rows are split randomly, data from the same unit can appear in both training and test sets. This can leak unit-specific information and make performance look better than it will be on truly new units.

The split should happen at the independent unit level:

split by patient, donor, batch, specimen, or sample when that is the true unit

For biomedical modeling, this is often more important than the exact train/test proportion.

Information Leakage

Information leakage occurs when data outside the training set influence the modeling process.

The obvious form is using test outcomes during training. But leakage can be subtler.

Examples:

preprocessing parameters are estimated from the full dataset before splitting
feature selection uses all rows, including test rows
test set performance is checked repeatedly while revising the model
test predictors influence which training rows or features are selected
samples from the same patient appear in both train and test sets

Keeping ames_train and ames_test as separate data frames is a useful practical guard, but it is not enough by itself. The main discipline is to keep all model-building decisions inside the training data.

Training Data Can Be Altered, Test Data Should Represent Reality

Sometimes the training set is deliberately modified.

For example, in imbalanced classification, it is common to subsample, upsample, downsample, or otherwise rebalance the training data.

That can be valid because it is part of the model training strategy.

The test set has a different job. It should resemble the new data the model will face in real use.

training data can be engineered for learning
test data should represent deployment reality

If the test set is artificially balanced, the performance estimate may no longer describe real-world performance.

Final Model Refit

The train/test split is used to develop, choose, and evaluate a reliable model.

After the final model has been chosen and its expected performance has been assessed on the test set, practitioners often refit the chosen model on all available data for production.

This is not the same as using all data to estimate performance.

The sequence is:

split the data
build and tune using training data
evaluate once using the test data
if the result is acceptable, refit the chosen model on all available data for production

The performance estimate still comes from the held-out test set.

Carry Forward

For the Ames workflow, the chapter carries forward this code:

library(tidymodels)
data(ames)

ames <- ames %>%
  mutate(Sale_Price = log10(Sale_Price))

set.seed(502)

ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test  <- testing(ames_split)

The key choices are:

Sale_Price is already log10-transformed
the split is 80/20
the split is stratified by Sale_Price
model development should use ames_train
final performance assessment should use ames_test

Checklist

Before splitting data, ask:

What is the independent experimental unit?
Is a row the true independent unit?
Is the task classification, regression, or time-dependent prediction?
Is the outcome imbalanced or skewed enough to stratify?
Should the test set be the most recent data instead of a random sample?
How much data is needed for stable training?
How much data is needed for stable final performance estimation?
Will tuning use a validation set or resampling inside the training set?
What preprocessing must be learned only from training data?
How will I keep the test set quarantined from model development?

Note

For my own modeling work, the main lesson is that splitting data is not a mechanical 80/20 habit. It is a design decision about evidence. The training set is where the model is built. The test set is where the chosen model is judged. If the test set helps build the model, it stops being a fair judge.

Sources

Kuhn & Silge, Tidy Modeling with R, Chapter 5: Spending our Data