Base R Modeling Fundamentals
Purpose
This page records the base R modeling conventions that are worth knowing before using tidymodels.
The point is not to become a base R modeling purist. The point is to understand what R is doing when a paper, script, or package writes a model like:
fit <- lm(y ~ x1 + x2 + group, data = dat)This matters because many statistical models in R still use formula syntax, including linear models, generalized linear models, Cox models, mixed models, and many package-specific modeling functions.
Formula Basics
In R, a model formula is symbolic.
y ~ x1Read it as:
outcome ~ predictors
The left side is the outcome y. The right side is the predictor set X.
For example:
rate ~ tempmeans:
- outcome:
rate - predictor:
temp
This is a supervised modeling setup because there is a known outcome.
Main Effects
Multiple predictors are added with +:
y ~ x1 + x2 + groupThis does not mean R mathematically adds x1, x2, and group before modeling. It means the model includes them as separate main effects.
In a biomedical paper, this is the same idea as saying a model was adjusted for age, sex, ancestry, smoking, or other covariates.
Factors and Dummy Variables
Most model engines need numeric inputs. If a predictor is a factor, R formula machinery usually creates indicator variables automatically.
Example:
y ~ age + sexIf sex is a factor with two levels, R will encode it as a 0/1 indicator. If a factor has five levels, R usually creates four dummy variables and leaves one level as the reference.
What to remember:
- the reference level affects coefficient interpretation
- categorical variables are not passed to the model as text
- the same encoding must be applied when predicting new samples
This is why sample metadata levels matter. A silently changed reference level can change coefficient signs and interpretation.
Interaction Terms
An interaction asks whether the effect of one predictor changes across another predictor.
Explicit interaction:
y ~ x1 + group + x1:groupShortcut:
y ~ x1 * groupThis expands to:
y ~ x1 + group + x1:groupReading check:
x1 + groupmeans adjusted main effects.x1:groupmeans interaction only.x1 * groupmeans both main effects and interaction.
Do not read age * sex as just “adjusted for age and sex”; it includes age, sex, and age:sex.
Inline Transformations
Formula syntax can also create transformed terms:
y ~ log(x)
y ~ poly(age, 3)
y ~ I((temp * 9 / 5) + 32)The I() function tells R to interpret the expression as arithmetic rather than formula syntax.
This is useful, but it also shows why formula is more than a variable list. It can do part of feature engineering.
What Formula Does
A formula usually does three jobs:
- defines which columns enter the model
- defines roles: outcome versus predictors
- encodes or creates model-ready columns
That third part is easy to forget. Formula syntax can create dummy variables, interaction columns, polynomial terms, and transformed predictors.
When preprocessing becomes more complex, tidymodels uses recipes to make these operations more explicit and easier to apply consistently across training and testing data.
Traditional Base R Workflow
A common base R modeling workflow looks like this:
fit <- lm(y ~ x1 + x2 + group, data = dat)Inspect a short model print:
fitCheck diagnostics:
plot(fit)Inspect coefficients and inferential statistics:
summary(fit)Compare nested models:
fit_small <- lm(y ~ x1 + x2, data = dat)
fit_big <- lm(y ~ x1 + x2 + group, data = dat)
anova(fit_small, fit_big)Predict new samples:
new_dat <- data.frame(
x1 = c(1, 2, 3),
x2 = c(4, 5, 6),
group = factor(c("A", "A", "B"))
)
predict(fit_big, newdata = new_dat)Prediction Requires Matching Columns
When using predict(), the new data should have the same variables and compatible factor levels as the training data.
Important:
- pass the original factor column, not manually created dummy columns, unless the model explicitly requires a matrix
- check factor levels before prediction
- be careful when new data contain a category absent from training data
This is the same practical issue in biomedical models: a trained model only knows the feature space it saw during training.
Missing Data
Missing data handling in base R can differ across model functions.
Common policies:
na.fail(): error if missing values existna.omit(): drop rows with missing valuesna.pass(): pass missing values throughna.exclude(): omit for fitting but preserve residual/prediction alignment
The dangerous one is silent row dropping. If prediction output has fewer rows than input data, merging predictions back to samples can misalign results.
Reading check:
- Did the paper or code say how missing values were handled?
- Were rows dropped before modeling?
- Were dropped samples reported?
- Were imputation or normalization steps fit only on training data?
Why Tidymodels Helps
Base R modeling is powerful, but the interface is not fully consistent across packages.
For example, different models may use different predict() options for class probabilities:
predict(fit, type = "response")
predict(fit, type = "prob")
predict(fit, type = "posterior")
predict(fit, type = "probability")tidymodels tries to make modeling more consistent by separating responsibilities:
rsample: data splitting and resamplingrecipes: preprocessing and feature engineeringparsnip: unified model specificationworkflows: combine preprocessing and modeltune: hyperparameter tuningyardstick: performance metricsbroom: tidy model outputs
The goal is not that base R is wrong. The goal is to make repeated modeling workflows easier to read, compare, and reproduce.
Reading Checks
When reading an R modeling script or method section, ask:
- What is on the left side of
~? That isy. - What is on the right side? That is the predictor set
X. - Are categorical predictors factors? What is the reference level?
- Are there interaction terms such as
x1:x2orx1 * x2? - Are transformations or polynomial terms created inside the formula?
- How were missing values handled?
- Is the model being used for inference, prediction, or both?
- If predictions are reported, were they evaluated on unseen data?
Note
For my own reading, the most useful rule is: formula syntax is not just notation. It defines roles and also performs model encoding. When a paper says it fit y ~ x1 + x2 + covariates, I should translate that into a concrete matrix: one outcome column, multiple predictor columns, dummy variables for categorical covariates, and possibly interaction or transformed columns.
This also keeps the X and y question clear. In a supervised model, y is the outcome being learned or modeled; X is the feature/covariate matrix after all encoding and preprocessing.
Sources
- Kuhn & Silge, Tidy Modeling with R, Chapter 3: A Review of R Modeling Fundamentals