Base R Modeling Fundamentals

tidymodels

Published

May 27, 2026

Purpose

This page records the base R modeling conventions that are worth knowing before using tidymodels.

The point is not to become a base R modeling purist. The point is to understand what R is doing when a paper, script, or package writes a model like:

fit <- lm(y ~ x1 + x2 + group, data = dat)

This matters because many statistical models in R still use formula syntax, including linear models, generalized linear models, Cox models, mixed models, and many package-specific modeling functions.

Formula Basics

In R, a model formula is symbolic.

y ~ x1

Read it as:

outcome ~ predictors

The left side is the outcome y. The right side is the predictor set X.

For example:

rate ~ temp

means:

outcome: rate
predictor: temp

This is a supervised modeling setup because there is a known outcome.

Main Effects

Multiple predictors are added with +:

y ~ x1 + x2 + group

This does not mean R mathematically adds x1, x2, and group before modeling. It means the model includes them as separate main effects.

In a biomedical paper, this is the same idea as saying a model was adjusted for age, sex, ancestry, smoking, or other covariates.

Factors and Dummy Variables

Most model engines need numeric inputs. If a predictor is a factor, R formula machinery usually creates indicator variables automatically.

Example:

y ~ age + sex

If sex is a factor with two levels, R will encode it as a 0/1 indicator. If a factor has five levels, R usually creates four dummy variables and leaves one level as the reference.

What to remember:

the reference level affects coefficient interpretation
categorical variables are not passed to the model as text
the same encoding must be applied when predicting new samples

This is why sample metadata levels matter. A silently changed reference level can change coefficient signs and interpretation.

Interaction Terms

An interaction asks whether the effect of one predictor changes across another predictor.

Explicit interaction:

y ~ x1 + group + x1:group

Shortcut:

y ~ x1 * group

This expands to:

y ~ x1 + group + x1:group

Reading check:

x1 + group means adjusted main effects.
x1:group means interaction only.
x1 * group means both main effects and interaction.

Do not read age * sex as just “adjusted for age and sex”; it includes age, sex, and age:sex.

Inline Transformations

Formula syntax can also create transformed terms:

y ~ log(x)
y ~ poly(age, 3)
y ~ I((temp * 9 / 5) + 32)

The I() function tells R to interpret the expression as arithmetic rather than formula syntax.

This is useful, but it also shows why formula is more than a variable list. It can do part of feature engineering.

What Formula Does

A formula usually does three jobs:

defines which columns enter the model
defines roles: outcome versus predictors
encodes or creates model-ready columns

That third part is easy to forget. Formula syntax can create dummy variables, interaction columns, polynomial terms, and transformed predictors.

When preprocessing becomes more complex, tidymodels uses recipes to make these operations more explicit and easier to apply consistently across training and testing data.

Traditional Base R Workflow

A common base R modeling workflow looks like this:

fit <- lm(y ~ x1 + x2 + group, data = dat)

Inspect a short model print:

fit

Check diagnostics:

plot(fit)

Inspect coefficients and inferential statistics:

summary(fit)

Compare nested models:

fit_small <- lm(y ~ x1 + x2, data = dat)
fit_big <- lm(y ~ x1 + x2 + group, data = dat)

anova(fit_small, fit_big)

Predict new samples:

new_dat <- data.frame(
  x1 = c(1, 2, 3),
  x2 = c(4, 5, 6),
  group = factor(c("A", "A", "B"))
)

predict(fit_big, newdata = new_dat)

Prediction Requires Matching Columns

When using predict(), the new data should have the same variables and compatible factor levels as the training data.

Important:

pass the original factor column, not manually created dummy columns, unless the model explicitly requires a matrix
check factor levels before prediction
be careful when new data contain a category absent from training data

This is the same practical issue in biomedical models: a trained model only knows the feature space it saw during training.

Missing Data

Missing data handling in base R can differ across model functions.

Common policies:

na.fail(): error if missing values exist
na.omit(): drop rows with missing values
na.pass(): pass missing values through
na.exclude(): omit for fitting but preserve residual/prediction alignment

The dangerous one is silent row dropping. If prediction output has fewer rows than input data, merging predictions back to samples can misalign results.

Reading check:

Did the paper or code say how missing values were handled?
Were rows dropped before modeling?
Were dropped samples reported?
Were imputation or normalization steps fit only on training data?

Why Tidymodels Helps

Base R modeling is powerful, but the interface is not fully consistent across packages.

For example, different models may use different predict() options for class probabilities:

predict(fit, type = "response")
predict(fit, type = "prob")
predict(fit, type = "posterior")
predict(fit, type = "probability")

tidymodels tries to make modeling more consistent by separating responsibilities:

rsample: data splitting and resampling
recipes: preprocessing and feature engineering
parsnip: unified model specification
workflows: combine preprocessing and model
tune: hyperparameter tuning
yardstick: performance metrics
broom: tidy model outputs

The goal is not that base R is wrong. The goal is to make repeated modeling workflows easier to read, compare, and reproduce.

Reading Checks

When reading an R modeling script or method section, ask:

What is on the left side of ~? That is y.
What is on the right side? That is the predictor set X.
Are categorical predictors factors? What is the reference level?
Are there interaction terms such as x1:x2 or x1 * x2?
Are transformations or polynomial terms created inside the formula?
How were missing values handled?
Is the model being used for inference, prediction, or both?
If predictions are reported, were they evaluated on unseen data?

Note

For my own reading, the most useful rule is: formula syntax is not just notation. It defines roles and also performs model encoding. When a paper says it fit y ~ x1 + x2 + covariates, I should translate that into a concrete matrix: one outcome column, multiple predictor columns, dummy variables for categorical covariates, and possibly interaction or transformed columns.

This also keeps the X and y question clear. In a supervised model, y is the outcome being learned or modeled; X is the feature/covariate matrix after all encoding and preprocessing.

Sources

Kuhn & Silge, Tidy Modeling with R, Chapter 3: A Review of R Modeling Fundamentals