Chapter 11 Multiple Linear Regression

Simple linear regression relates an outcome to one predictor. Multiple linear regression extends the same idea by including several predictors in one model. Researchers use this approach to estimate partial associations: each coefficient describes the association between a predictor and the outcome holding the other predictors constant.

In this chapter, we use ICAN-ICAR 2025 survey data to model a continuous outcome (for example, a numeracy score) as a function of grade, age, and background characteristics. Because ICAN-ICAR uses stratification, clustering, and survey weights, we fit models with the survey package so that standard errors reflect the sampling design.

11.1 The multiple regression model and its interpretation

Multiple linear regression models a continuous outcome as a linear function of several predictors:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k + \varepsilon. \]

Each slope coefficient is a partial effect. For example, in a model with grade and age, the coefficient for grade is the expected change in the outcome associated with one additional grade among learners of the same age, on average.

Multiple regression does not automatically solve causal problems. It improves adjustment only when you include relevant confounders and the model matches the data-generating process.

11.2 Data and analysis sample

To keep the example concrete, we estimate models within one country and within an age-restricted analysis sample. Choose any country that has a sufficient sample size.

country_name = "Kenya"

des_country = subset(
  des,
    CountryName == country_name &
    !is.na(numeracy_score) &
    !is.na(age) &
    !is.na(grade) &
    !is.na(gender) &
    !is.na(location)
)

If your data do not contain numeracy_score, switch the outcome to another continuous variable (for example, ican_time). The modelling steps stay the same.

11.3 Building a multiple regression model

We build the model in steps so that you can see how coefficients change when you add controls.

Model 1 (bivariate). Regress numeracy score on grade.

Model 2 (add a control). Add age. The grade coefficient now reflects the association between grade and numeracy score among learners of the same age.

Model 3 (add background variables). Add gender and location. These covariates often correlate with both grade and learning outcomes, so the model conditions on them.

m1 = svyglm(numeracy_score ~ grade, design = des_country)
m2 = svyglm(numeracy_score ~ grade + age, design = des_country)
m3 = svyglm(numeracy_score ~ grade + age + gender + location, design = des_country)

summary(m3)

## 
## Call:
## svyglm(formula = numeracy_score ~ grade + age + gender + location, 
##     design = des_country)
## 
## Survey design:
## subset(des, CountryName == country_name & !is.na(numeracy_score) & 
##     !is.na(age) & !is.na(grade) & !is.na(gender) & !is.na(location))
## 
## Coefficients:
##                             Estimate             Std. Error                t value   Pr(>|t|)    
## (Intercept)   15.9460302642038111998  0.9681221804983300983 16.4710900000000002308 < 2.22e-16 ***
## grade          1.7402941638056099727  0.1587160764477630881 10.9648299999999991883 < 2.22e-16 ***
## age           -0.1206894540461678905  0.1537464776280053902 -0.7849899999999999656  0.4335456    
## genderMale    -0.3662493600655079007  0.3282853648533415880 -1.1156399999999999650  0.2661402    
## locationUrban  2.0460055138750821158  0.6189479846555516751  3.3056199999999997807  0.0011547 ** 
## ---
## Signif. codes:  
## 0 '***' 0.001000000000000000020817 '**' 0.01000000000000000020817 '*' 0.05000000000000000277556 '.' 0.1000000000000000055511 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 80.05414803169205129052)
## 
## Number of Fisher Scoring iterations: 2

11.3.1 Comparing coefficients across models

A short table helps you compare how the coefficient on grade changes as you add controls.

grade_compare = bind_rows(
  tidy(m1, conf.int = TRUE) |>
    filter(term == "grade") |>
    mutate(model = "M1: grade"),
  tidy(m2, conf.int = TRUE) |>
    filter(term == "grade") |>
    mutate(model = "M2: grade + age"),
  tidy(m3, conf.int = TRUE) |>
    filter(term == "grade") |>
    mutate(model = "M3: + gender + location")
) |>
  select(model, estimate, conf.low, conf.high, p.value)

grade_compare

## # A tibble: 3 × 5
##   model                   estimate conf.low conf.high  p.value
##   <chr>                      <dbl>    <dbl>     <dbl>    <dbl>
## 1 M1: grade                   1.64     1.51      1.77 5.91e-59
## 2 M2: grade + age             1.79     1.46      2.12 8.28e-21
## 3 M3: + gender + location     1.74     1.43      2.05 1.59e-21

If the grade coefficient changes noticeably, the additional variables likely explain part of the association between grade and the outcome. Interpret this pattern as statistical adjustment, not as proof of causality.

11.3.2 Interpreting categorical predictors

R automatically creates indicator (dummy) variables for factors. In Model 3:

the genderMale coefficient compares males with females (the reference group), holding other predictors constant;
the locationUrban coefficient compares urban with rural, holding other predictors constant.

If you want a different reference category, set it explicitly:

dat = dat |>
  mutate(
    gender = relevel(gender, ref = "Female"),
    location = relevel(location, ref = "Rural")
  )

11.4 Extensions: interactions and non-linear effects

Use extensions only when they answer a substantive question and you can justify them.

Interactions. An interaction allows the association between one predictor and the outcome to differ across groups. For example, the gender gap may differ between rural and urban areas:

m_int = svyglm(
  numeracy_score ~ grade + age + gender * location,
  design = des_country
)

m_int |>
  tidy(conf.int = TRUE)

Transformations. When the outcome is positive and right-skewed (such as the reading IRT score), a log transformation often improves interpretability and fit. In a log outcome model, a coefficient approximates a percentage change:

des_log = update(des_country, ln_ReadingIRTScore = log(ReadingIRTScore))
  m_log = svyglm(ln_ReadingIRTScore ~ age + gender + location, design = des_log)

# Approximate percent change for a one-unit increase in age:
100 * coef(m_log)[["age"]]

Quadratic terms. A squared term captures curvature (for example, diminishing returns to grade):

des_quad = update(des_country, grade_sq = grade^2)
  m_quad = svyglm(numeracy_score ~ grade + grade_sq + age + gender + location, design = des_quad)

# Turning point (only meaningful if the quadratic term is substantively justified)
b <- coef(m_quad)[["grade"]]
a <- coef(m_quad)[["grade_sq"]]
turning_point = -b / (2 * a)
turning_point

11.5 Reporting results

Report results with coefficients, uncertainty, and clear reference categories. A compact regression table helps readers interpret the model.

report_tbl = tidy(m3, conf.int = TRUE) |>
  mutate(
    term = recode(
      term,
      `(Intercept)` = "Intercept",
      grade = "Grade (one level)",
      age = "Age (years)",
      genderMale = "Male (ref: Female)",
      locationUrban = "Urban (ref: Rural)"
    )
  ) |>
  select(term, estimate, conf.low, conf.high, p.value)

report_tbl

## # A tibble: 5 × 5
##   term               estimate conf.low conf.high  p.value
##   <chr>                 <dbl>    <dbl>     <dbl>    <dbl>
## 1 Intercept            15.9     14.0      17.9   4.00e-37
## 2 Grade (one level)     1.74     1.43      2.05  1.59e-21
## 3 Age (years)          -0.121   -0.424     0.183 4.34e- 1
## 4 Male (ref: Female)   -0.366   -1.01      0.282 2.66e- 1
## 5 Urban (ref: Rural)    2.05     0.824     3.27  1.15e- 3

When you describe results, state the outcome scale and the comparison group. For example: “Holding age, gender, and location constant, one additional grade is associated with an average increase of b points in numeracy score.”

11.6 Practice exercises

Change country_name and re-estimate Model 3. Which coefficients are stable across countries?
Fit a model with numeracy_score ~ age + grade + gender + location + age:gender. Interpret the interaction term.
Replace the outcome with ican_time and refit the model. Compare the interpretation before and after logging the outcome.