Exploratory Data Analysis
Purpose
Exploratory data analysis (EDA) is the step before modeling where I learn what the data actually look like.
The goal is not to make attractive plots. The goal is to find modeling-relevant facts before I split, preprocess, fit, tune, or interpret a model.
For a supervised modeling problem, EDA should clarify:
- what the outcome is
- what the predictors are
- whether the outcome needs transformation
- whether predictors have odd distributions, missingness, outliers, or leakage risk
- whether categorical variables and spatial labels are reliable enough to model directly
- what assumptions or preprocessing steps must be carried into the workflow
Ames Example
In Tidy Modeling with R, Chapter 4 introduces the Ames housing data. The modeling goal is to predict house sale price from information about each property.
In modeling notation:
y = Sale_Price
X = house characteristics + location + lot information + condition and quality variables
The important point is that this chapter does not fit a model yet. It first asks whether the data are ready to be modeled.
Know the Data Source
The book uses the transformed Ames data from modeldata, not the raw data directly.
library(modeldata)
data(ames)This matters because the data have already been made more modeling-ready:
- some implicit missing values were recoded as explicit “no feature” levels
- categorical predictors were converted to factors
- some quality descriptors were removed because they behave more like outcomes than predictors
- longitude and latitude were added for each property
The first EDA question is therefore not only “what is in the data?” but also “what has already been done to the data?”
Missing Does Not Always Mean Unknown
In housing data, a missing value may mean different things.
For example, a missing alley value in the raw data may mean the house has no alley access, not that the alley status is unknown. In the transformed Ames data, this is encoded explicitly as a category such as No_Alley_Access.
This distinction matters:
unknown value != absent feature
If “no garage”, “no pool”, or “no alley” is treated as generic missingness, the model can lose useful signal or drop rows unnecessarily.
Start With the Outcome
EDA should usually start with the outcome. In the Ames data, the outcome is Sale_Price.
ggplot(ames, aes(x = Sale_Price)) +
geom_histogram(bins = 50, col = "white")The sale price distribution is right-skewed:
- many houses are relatively inexpensive
- fewer houses are very expensive
- the right tail is long
This affects modeling because a small number of expensive houses can dominate the error. Some models may also produce impossible predictions such as negative sale prices.
Outcome Transformation
The book uses a log transformation for sale price:
ames <- ames %>%
mutate(Sale_Price = log10(Sale_Price))From this point forward, Sale_Price means log10 sale price, not raw dollar price.
Why this helps:
- the outcome distribution becomes less skewed
- expensive houses have less excessive influence on the model
- predictions are easier to keep in a plausible positive-price range after back-transformation
- residual variance may be more stable
- some statistical model assumptions may become more reasonable
The tradeoff is interpretation. A model error on the log scale is harder to understand than an error in dollars.
For example:
RMSE = 20000
on the dollar scale is intuitive. But:
RMSE = 0.15
on the log10 scale needs translation before it feels meaningful.
Inspect Predictors
After the outcome, EDA should inspect predictors.
Questions to ask:
- Are individual predictors strongly skewed?
- Are there extreme outliers?
- Are there impossible or suspicious values?
- Are some predictors almost constant?
- Are there missing values, and do they mean unknown or absent?
- Are categorical variables stored as factors?
- Are factor levels rare, misspelled, or inconsistent?
These checks affect later preprocessing decisions in recipes, such as imputation, transformation, dummy encoding, level pooling, and normalization.
Check Redundancy
EDA should also ask whether predictors are redundant.
In housing data, several variables may describe house size:
- lot area
- first floor area
- second floor area
- basement area
- total living area
- garage area
These variables may be highly correlated. That does not automatically mean they are wrong, but it affects how models behave.
For linear models, strong correlation between predictors can make coefficients unstable and harder to interpret. For prediction-focused models, correlation may be less damaging, but it still affects feature importance and interpretation.
Check Predictor-Outcome Relationships
EDA should examine how predictors relate to the outcome.
For Ames, useful questions include:
- Do larger houses sell for more?
- Do newer houses sell for more?
- Are some neighborhoods consistently more expensive?
- Does overall quality track sale price?
- Are there nonlinear relationships?
- Are there groups with unusual price patterns?
This is not formal inference yet. The goal is to learn whether the relationships are plausible, nonlinear, noisy, or suspicious.
Spatial Structure
The Ames data include both categorical and numeric location information:
Neighborhood- longitude
- latitude
Chapter 4 shows that neighborhood labels are useful but imperfect.
Examples:
- the center of Ames has a gap because Iowa State University has no residential house sales there
- some neighborhoods are geographically isolated
- Meadow Village is surrounded by Mitchell
- Northridge and Somerset have some mixed labels near their boundaries
- Crawford includes a small isolated group away from the main cluster
- the Iowa DOT and Rail Road neighborhood has multiple clusters and longitudinal outliers
The modeling lesson is:
geographic labels are summaries, not perfect spatial truth
This matters because a model may benefit from Neighborhood, longitude and latitude, or both.
Leakage Check
EDA should look for variables that are too close to the outcome.
In the Ames data used by the book, some quality descriptors were removed because they behave more like outcomes than predictors.
This is a general modeling rule:
Do not let future information, post-sale information, or outcome-like variables leak into X.
Leakage can produce a model that looks accurate during evaluation but fails in real use.
EDA Before Splitting
EDA often happens before train/test splitting because I need to understand the whole dataset first.
But there is a boundary:
- descriptive inspection is fine
- deciding the modeling target is fine
- noticing missingness, skewness, factor levels, and spatial structure is fine
- estimating preprocessing parameters from the full data is not fine
For example, it is acceptable to decide that Sale_Price should be log-transformed. But imputation values, normalization parameters, PCA loadings, and feature selection should be learned from the training set inside the modeling workflow.
Carry Forward
For the Ames data, the book carries forward this preparation:
library(tidymodels)
data(ames)
ames <- ames %>%
mutate(Sale_Price = log10(Sale_Price))The important memory is that every later model in this workflow predicts log10 sale price.
EDA Checklist
Before fitting a model, ask:
- What is the outcome?
- Is this regression, classification, or another task?
- What does one row represent?
- What has already been cleaned or transformed?
- Does missingness mean unknown, absent, or not applicable?
- Is the outcome skewed enough to transform?
- Are predictor distributions plausible?
- Are categorical predictors encoded as factors with sensible levels?
- Are there rare levels or mislabeled categories?
- Are predictors highly correlated or redundant?
- Are any predictors too close to the outcome?
- Are there spatial, temporal, batch, or grouping structures?
- Which preprocessing decisions must be handled inside the training workflow?
- What assumptions must I remember when interpreting metrics?
Note
For my own modeling work, EDA is the step that prevents me from treating a dataset as a clean matrix too early. Before there is a model, there is a question, an outcome, a row definition, a cleaning history, and a set of variables with quirks. The Ames chapter is useful because it shows this in a practical way: look at Sale_Price, decide on the log scale, inspect location, and remember these choices before building workflows.
Sources
- Kuhn & Silge, Tidy Modeling with R, Chapter 4: The Ames Housing Data