Exploratory Data Analysis

tidymodels

Published

May 28, 2026

Purpose

Exploratory data analysis (EDA) is the step before modeling where I learn what the data actually look like.

The goal is not to make attractive plots. The goal is to find modeling-relevant facts before I split, preprocess, fit, tune, or interpret a model.

For a supervised modeling problem, EDA should clarify:

what the outcome is
what the predictors are
whether the outcome needs transformation
whether predictors have odd distributions, missingness, outliers, or leakage risk
whether categorical variables and spatial labels are reliable enough to model directly
what assumptions or preprocessing steps must be carried into the workflow

Ames Example

In Tidy Modeling with R, Chapter 4 introduces the Ames housing data. The modeling goal is to predict house sale price from information about each property.

In modeling notation:

y = Sale_Price
X = house characteristics + location + lot information + condition and quality variables

The important point is that this chapter does not fit a model yet. It first asks whether the data are ready to be modeled.

Know the Data Source

The book uses the transformed Ames data from modeldata, not the raw data directly.

library(modeldata)
data(ames)

This matters because the data have already been made more modeling-ready:

some implicit missing values were recoded as explicit “no feature” levels
categorical predictors were converted to factors
some quality descriptors were removed because they behave more like outcomes than predictors
longitude and latitude were added for each property

The first EDA question is therefore not only “what is in the data?” but also “what has already been done to the data?”

Missing Does Not Always Mean Unknown

In housing data, a missing value may mean different things.

For example, a missing alley value in the raw data may mean the house has no alley access, not that the alley status is unknown. In the transformed Ames data, this is encoded explicitly as a category such as No_Alley_Access.

This distinction matters:

unknown value != absent feature

If “no garage”, “no pool”, or “no alley” is treated as generic missingness, the model can lose useful signal or drop rows unnecessarily.

Start With the Outcome

EDA should usually start with the outcome. In the Ames data, the outcome is Sale_Price.

ggplot(ames, aes(x = Sale_Price)) +
  geom_histogram(bins = 50, col = "white")

The sale price distribution is right-skewed:

many houses are relatively inexpensive
fewer houses are very expensive
the right tail is long

This affects modeling because a small number of expensive houses can dominate the error. Some models may also produce impossible predictions such as negative sale prices.

Outcome Transformation

The book uses a log transformation for sale price:

ames <- ames %>%
  mutate(Sale_Price = log10(Sale_Price))

From this point forward, Sale_Price means log10 sale price, not raw dollar price.

Why this helps:

the outcome distribution becomes less skewed
expensive houses have less excessive influence on the model
predictions are easier to keep in a plausible positive-price range after back-transformation
residual variance may be more stable
some statistical model assumptions may become more reasonable

The tradeoff is interpretation. A model error on the log scale is harder to understand than an error in dollars.

For example:

RMSE = 20000

on the dollar scale is intuitive. But:

RMSE = 0.15

on the log10 scale needs translation before it feels meaningful.

Inspect Predictors

After the outcome, EDA should inspect predictors.

Questions to ask:

Are individual predictors strongly skewed?
Are there extreme outliers?
Are there impossible or suspicious values?
Are some predictors almost constant?
Are there missing values, and do they mean unknown or absent?
Are categorical variables stored as factors?
Are factor levels rare, misspelled, or inconsistent?

These checks affect later preprocessing decisions in recipes, such as imputation, transformation, dummy encoding, level pooling, and normalization.

Check Redundancy

EDA should also ask whether predictors are redundant.

In housing data, several variables may describe house size:

lot area
first floor area
second floor area
basement area
total living area
garage area

These variables may be highly correlated. That does not automatically mean they are wrong, but it affects how models behave.

For linear models, strong correlation between predictors can make coefficients unstable and harder to interpret. For prediction-focused models, correlation may be less damaging, but it still affects feature importance and interpretation.

Check Predictor-Outcome Relationships

EDA should examine how predictors relate to the outcome.

For Ames, useful questions include:

Do larger houses sell for more?
Do newer houses sell for more?
Are some neighborhoods consistently more expensive?
Does overall quality track sale price?
Are there nonlinear relationships?
Are there groups with unusual price patterns?

This is not formal inference yet. The goal is to learn whether the relationships are plausible, nonlinear, noisy, or suspicious.

Spatial Structure

The Ames data include both categorical and numeric location information:

Neighborhood
longitude
latitude

Chapter 4 shows that neighborhood labels are useful but imperfect.

Examples:

the center of Ames has a gap because Iowa State University has no residential house sales there
some neighborhoods are geographically isolated
Meadow Village is surrounded by Mitchell
Northridge and Somerset have some mixed labels near their boundaries
Crawford includes a small isolated group away from the main cluster
the Iowa DOT and Rail Road neighborhood has multiple clusters and longitudinal outliers

The modeling lesson is:

geographic labels are summaries, not perfect spatial truth

This matters because a model may benefit from Neighborhood, longitude and latitude, or both.

Leakage Check

EDA should look for variables that are too close to the outcome.

In the Ames data used by the book, some quality descriptors were removed because they behave more like outcomes than predictors.

This is a general modeling rule:

Do not let future information, post-sale information, or outcome-like variables leak into X.

Leakage can produce a model that looks accurate during evaluation but fails in real use.

EDA Before Splitting

EDA often happens before train/test splitting because I need to understand the whole dataset first.

But there is a boundary:

descriptive inspection is fine
deciding the modeling target is fine
noticing missingness, skewness, factor levels, and spatial structure is fine
estimating preprocessing parameters from the full data is not fine

For example, it is acceptable to decide that Sale_Price should be log-transformed. But imputation values, normalization parameters, PCA loadings, and feature selection should be learned from the training set inside the modeling workflow.

Carry Forward

For the Ames data, the book carries forward this preparation:

library(tidymodels)
data(ames)
ames <- ames %>%
  mutate(Sale_Price = log10(Sale_Price))

The important memory is that every later model in this workflow predicts log10 sale price.

EDA Checklist

Before fitting a model, ask:

What is the outcome?
Is this regression, classification, or another task?
What does one row represent?
What has already been cleaned or transformed?
Does missingness mean unknown, absent, or not applicable?
Is the outcome skewed enough to transform?
Are predictor distributions plausible?
Are categorical predictors encoded as factors with sensible levels?
Are there rare levels or mislabeled categories?
Are predictors highly correlated or redundant?
Are any predictors too close to the outcome?
Are there spatial, temporal, batch, or grouping structures?
Which preprocessing decisions must be handled inside the training workflow?
What assumptions must I remember when interpreting metrics?

Note

For my own modeling work, EDA is the step that prevents me from treating a dataset as a clean matrix too early. Before there is a model, there is a question, an outcome, a row definition, a cleaning history, and a set of variables with quirks. The Ames chapter is useful because it shows this in a practical way: look at Sale_Price, decide on the log scale, inspect location, and remember these choices before building workflows.

Sources

Kuhn & Silge, Tidy Modeling with R, Chapter 4: The Ames Housing Data