Base R Modeling Fundamentals

tidymodels
Published

May 27, 2026

Purpose

This page records the base R modeling conventions that are worth knowing before using tidymodels.

The point is not to become a base R modeling purist. The point is to understand what R is doing when a paper, script, or package writes a model like:

fit <- lm(y ~ x1 + x2 + group, data = dat)

This matters because many statistical models in R still use formula syntax, including linear models, generalized linear models, Cox models, mixed models, and many package-specific modeling functions.

Formula Basics

In R, a model formula is symbolic.

y ~ x1

Read it as:

outcome ~ predictors

The left side is the outcome y. The right side is the predictor set X.

For example:

rate ~ temp

means:

  • outcome: rate
  • predictor: temp

This is a supervised modeling setup because there is a known outcome.

Main Effects

Multiple predictors are added with +:

y ~ x1 + x2 + group

This does not mean R mathematically adds x1, x2, and group before modeling. It means the model includes them as separate main effects.

In a biomedical paper, this is the same idea as saying a model was adjusted for age, sex, ancestry, smoking, or other covariates.

Factors and Dummy Variables

Most model engines need numeric inputs. If a predictor is a factor, R formula machinery usually creates indicator variables automatically.

Example:

y ~ age + sex

If sex is a factor with two levels, R will encode it as a 0/1 indicator. If a factor has five levels, R usually creates four dummy variables and leaves one level as the reference.

What to remember:

  • the reference level affects coefficient interpretation
  • categorical variables are not passed to the model as text
  • the same encoding must be applied when predicting new samples

This is why sample metadata levels matter. A silently changed reference level can change coefficient signs and interpretation.

Interaction Terms

An interaction asks whether the effect of one predictor changes across another predictor.

Explicit interaction:

y ~ x1 + group + x1:group

Shortcut:

y ~ x1 * group

This expands to:

y ~ x1 + group + x1:group

Reading check:

  • x1 + group means adjusted main effects.
  • x1:group means interaction only.
  • x1 * group means both main effects and interaction.

Do not read age * sex as just “adjusted for age and sex”; it includes age, sex, and age:sex.

Inline Transformations

Formula syntax can also create transformed terms:

y ~ log(x)
y ~ poly(age, 3)
y ~ I((temp * 9 / 5) + 32)

The I() function tells R to interpret the expression as arithmetic rather than formula syntax.

This is useful, but it also shows why formula is more than a variable list. It can do part of feature engineering.

What Formula Does

A formula usually does three jobs:

  • defines which columns enter the model
  • defines roles: outcome versus predictors
  • encodes or creates model-ready columns

That third part is easy to forget. Formula syntax can create dummy variables, interaction columns, polynomial terms, and transformed predictors.

When preprocessing becomes more complex, tidymodels uses recipes to make these operations more explicit and easier to apply consistently across training and testing data.

Traditional Base R Workflow

A common base R modeling workflow looks like this:

fit <- lm(y ~ x1 + x2 + group, data = dat)

Inspect a short model print:

fit

Check diagnostics:

plot(fit)

Inspect coefficients and inferential statistics:

summary(fit)

Compare nested models:

fit_small <- lm(y ~ x1 + x2, data = dat)
fit_big <- lm(y ~ x1 + x2 + group, data = dat)

anova(fit_small, fit_big)

Predict new samples:

new_dat <- data.frame(
  x1 = c(1, 2, 3),
  x2 = c(4, 5, 6),
  group = factor(c("A", "A", "B"))
)

predict(fit_big, newdata = new_dat)

Prediction Requires Matching Columns

When using predict(), the new data should have the same variables and compatible factor levels as the training data.

Important:

  • pass the original factor column, not manually created dummy columns, unless the model explicitly requires a matrix
  • check factor levels before prediction
  • be careful when new data contain a category absent from training data

This is the same practical issue in biomedical models: a trained model only knows the feature space it saw during training.

Missing Data

Missing data handling in base R can differ across model functions.

Common policies:

  • na.fail(): error if missing values exist
  • na.omit(): drop rows with missing values
  • na.pass(): pass missing values through
  • na.exclude(): omit for fitting but preserve residual/prediction alignment

The dangerous one is silent row dropping. If prediction output has fewer rows than input data, merging predictions back to samples can misalign results.

Reading check:

  • Did the paper or code say how missing values were handled?
  • Were rows dropped before modeling?
  • Were dropped samples reported?
  • Were imputation or normalization steps fit only on training data?

Why Tidymodels Helps

Base R modeling is powerful, but the interface is not fully consistent across packages.

For example, different models may use different predict() options for class probabilities:

predict(fit, type = "response")
predict(fit, type = "prob")
predict(fit, type = "posterior")
predict(fit, type = "probability")

tidymodels tries to make modeling more consistent by separating responsibilities:

  • rsample: data splitting and resampling
  • recipes: preprocessing and feature engineering
  • parsnip: unified model specification
  • workflows: combine preprocessing and model
  • tune: hyperparameter tuning
  • yardstick: performance metrics
  • broom: tidy model outputs

The goal is not that base R is wrong. The goal is to make repeated modeling workflows easier to read, compare, and reproduce.

Reading Checks

When reading an R modeling script or method section, ask:

  1. What is on the left side of ~? That is y.
  2. What is on the right side? That is the predictor set X.
  3. Are categorical predictors factors? What is the reference level?
  4. Are there interaction terms such as x1:x2 or x1 * x2?
  5. Are transformations or polynomial terms created inside the formula?
  6. How were missing values handled?
  7. Is the model being used for inference, prediction, or both?
  8. If predictions are reported, were they evaluated on unseen data?

Note

For my own reading, the most useful rule is: formula syntax is not just notation. It defines roles and also performs model encoding. When a paper says it fit y ~ x1 + x2 + covariates, I should translate that into a concrete matrix: one outcome column, multiple predictor columns, dummy variables for categorical covariates, and possibly interaction or transformed columns.

This also keeps the X and y question clear. In a supervised model, y is the outcome being learned or modeled; X is the feature/covariate matrix after all encoding and preprocessing.

Sources

  • Kuhn & Silge, Tidy Modeling with R, Chapter 3: A Review of R Modeling Fundamentals