8 Analyzing PASEC 2019 data in R

This section provides a practical introduction to analysing PASEC 2019 data in R. It is intended for users with a basic working knowledge of R and includes step-by-step examples covering common PASEC analyses.

Use the contents below to navigate directly to the topics most relevant to your analysis.

8.1 Package installation

To get started, you need to install the required packages once. In the R console, type:

install.packages(c('Rrepest', 'haven', 'survey', 'mitools', 'quantreg', 'tidyverse')) 
  • The haven package is needed to read Stata files.
  • The Rrepest package automates the handling of plausible values and replicate weights, making it easier to analyse PASEC data correctly in R. It supports a range of common analyses, including means, summary statistics, frequency distributions, quantiles, correlations, group comparisons, and regression models. Throughout this guide we introduce the commands needed for common PASEC analyses.
  • The survey and mitools packages are used only for the alternatives analyses shown in ….
  • While the tidyverse package is for data wrangling.

You only need to run install.packages() once; there is no need to reinstall each time you open R.

# Load packages at the start of every session

library(haven)   
library(Rrepest)  
library(survey)
library(mitools)
library(tidyverse)  

8.2 Loading data in R

For hands-on examples we’ll start with the Grade 2 study.

# Paths below are relative to the project root
g2 = read_dta("data/PASEC2019_GRADE2_INT.dta")

It is good practice to run your analysis from an R script rather than the console, as this makes your work reproducible. All examples in this section are written as script code.

8.3 Applying English Labels

To encourage and assist non-French speaking users to analyse the rich PASEC data, we have supplied R scripts and do-files that apply English variable and value labels. These files are in the data/ folder of this project.

# Apply English labels
source("data/PASEC2019_Grade2_EN_labels.R")
g2 = apply_pasec_grade2_en_labels(g2)

After loading the file, let’s check if we have the most important variables that Rrepest uses. The five plausible values for language and mathematics are LECT_PV1 to LECT_PV5 and MATHS_PV1 to MATHS_PV5. The final weight is rwgt0. The replicate weights are rwgt1 to rwgt45 and countries are indicated by ID_PAYS.

# Confirm key variables exist 
key_vars <- c('LECT_PV1','LECT_PV5','MATHS_PV1','MATHS_PV5', 
              'ID_PAYS','rwgt0','rwgt1','rwgt45') 
stopifnot(all(key_vars %in% names(g2))) 

# Quick look at structure 
g2 |> 
  select(all_of(key_vars)) |> 
  glimpse() 
Rows: 21,933
Columns: 8
$ LECT_PV1  <dbl> 640.5484, 752.6683, 533.4055, 452.2114, 653.5640, 567.1990, 791.0265, 601.9592, 643.6853, 597.2301, 631.8771, 671.0275, 711.6534, 536.4511, 66…
$ LECT_PV5  <dbl> 681.0913, 750.5060, 520.3950, 454.3549, 670.3465, 584.3690, 703.6523, 583.4270, 630.7553, 640.1383, 646.9531, 649.6097, 711.6534, 540.5448, 62…
$ MATHS_PV1 <dbl> 627.7454, 686.3843, 492.4545, 456.3050, 699.5146, 594.5303, 605.2292, 625.7197, 583.2004, 598.8627, 672.5724, 661.6907, 688.4756, 504.6280, 63…
$ MATHS_PV5 <dbl> 649.4599, 656.8358, 475.3191, 481.9290, 663.3732, 612.6745, 615.9355, 649.5237, 555.9525, 618.9540, 651.0314, 645.4049, 688.4756, 470.7990, 60…
$ ID_PAYS   <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, …
$ rwgt0     <dbl> 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 15…
$ rwgt1     <dbl> 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 30…
$ rwgt45    <dbl> 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 15…

8.4 Rrepest command syntax

The basic syntax of the Rrepest() function is as follows:

Rrepest( 
  data       = <data frame or subset>, 
  svy        = <survey name>, 
  est        = est(<statistic>, target = <variable(s)>, regressor = <variable(s)>), 
  by         = <grouping variable(s)>,  # separate analysis per group 
  over       = <subgroup variable(s)>,  # within-group comparison 
  cm.weights = c(<final weight>, <replicate weights>), 
  var.factor = <variance scaling constant>, 
  n.pvs      = <number of plausible values> 
) 

Once these survey settings have been specified, you can use Rrepest to estimate means, summary statistics, proficiency levels, group differences, percentiles, and regression models while correctly accounting for both plausible values and replicate weights.

8.5 Before you begin: Set up Rrepest for PASEC

PASEC is not one of the studies supported by the Rrepest package meaning that there aren’t built-in survey specifications. Therefore, before running any analyses, you need to tell Rrepest how the PASEC data are structured.

You will need to use svy = 'SVY' and supply the survey parameters directly. To analyse PASEC data correctly, Rrepest needs to know the following survey specifications:

  • PASEC uses the paired Jackknife method for creating the replicate weights (JK2).

  • There are 90 replicate weights in the Grade 6 data and 45 replicate weights in the Grade 2 data.

  • The final weight is given by rwgt0

  • The replicate weights are rwgt1 to rwgt90 in the Grade 6 data and rwgt1 to rwgt45 in the Grade 2 data.

  • There are five sets of plausible values for each of mathematics, MATHS_PV1 to MATHS_PV5 and language LECT_PV1 to LECT_PV5.

We need to define weight vectors once and reuse them throughout your analyses:

# Weight helpers — define once, use everywhere 
cm2 = c('rwgt0', paste0('rwgt', 1:45))   # Grade 2: final + 45 replicate weights 
cm6 = c('rwgt0', paste0('rwgt', 1:90))   # Grade 6: final + 90 replicate weights 

Command for the grade 2 data will have the following syntax

Rrepest( 
  data       = subset(g2, <condition>), 
  svy        = 'SVY', 
  est        = est(<statistic>, target = <variable>), 
  cm.weights = cm2, 
  var.factor = 45, 
  n.pvs      = 5 
) 

For Grade 6, we need to specify that there are 90 replicate weights. Commands for grade 6 as follows

Rrepest( 
  data       = subset(g2, <condition>), 
  svy        = 'SVY', 
  est        = est(<statistic>, target = <variable>), 
  cm.weights = cm6, 
  var.factor = 90, 
  n.pvs      = 5 
) 

8.6 PASEC Analyses Examples

8.6.1 Calculating Mean Age by Country

Let’s start by calculating the average age of students in Grade 2 in Benin. The age variable is qe22. Let’s first inspect the ID_PAYS and qe22 variables.

# Country frequencies 
g2 |> 
  count(ID_PAYS)
# A tibble: 14 × 2
   ID_PAYS                               n
   <dbl+lbl>                         <int>
 1  1 [Benin]                         1654
 2  2 [Burkina Faso]                  1884
 3  3 [Burundi]                       1664
 4  4 [Cameroun]                      1780
 5  5 [Congo]                         1553
 6  6 [Cote D'Ivoire]                 1332
 7  7 [Gabon]                         1157
 8  8 [Guinee]                        1086
 9  9 [Madagascar]                    1883
10 10 [Niger]                         1730
11 11 [Democratic Republic of Congo]  1050
12 12 [Senegal]                       1341
13 13 [Chad]                          1727
14 14 [Togo]                          2092
# Age distribution 
g2|>
  count(qe22)
# A tibble: 19 × 2
   qe22             n
   <dbl+lbl>    <int>
 1  4               2
 2  5              83
 3  6            1295
 4  7            5467
 5  8            6923
 6  9            4261
 7 10            2099
 8 11             800
 9 12             458
10 13             189
11 14              73
12 15              41
13 16              19
14 17               8
15 18               1
16 19               1
17 20               1
18 22               1
19 99 [Missing]   211

NOTE: Benin is coded as \(1\). Missing age values are coded as \(99\). Because \(99\) is a placeholder used to indicate missing data rather than a learner’s actual age, these observations should be excluded from analyses involving age. Failure to do so may produce misleading results.

To calculate the average age of students in Grade 2 in Benin, the syntax is as follows:

# Calculating average age in Benin 

result_age = Rrepest( 
  data       = subset(g2, ID_PAYS == 1 & qe22 < 99), 
  svy        = 'SVY', 
  est        = est('mean', target = 'qe22'), 
  cm.weights = cm2, 
  var.factor = 45 
) 
print(result_age) 
# A tibble: 1 × 3
  .     b.mean.qe22 se.mean.qe22
  <chr>       <dbl>        <dbl>
1 Total        7.57       0.0599

Because age is an observed variable rather than a plausible-value variable, Rrepest uses only the sampling variance derived from the replicate weights. The reported standard error therefore reflects uncertainty arising from the sample design but not measurement uncertainty.

The estimated mean age of Grade 2 students in Benin is \(7.57\) years (\(95\%\) CI: \(7.45–7.68\)). This reflects the substantial grade repetition and late entry common in the region, with many students aged \(9–11\) also enrolled in Grade 2.

Alternative to generate the same result using survey package

PASEC provides replicate weights specifically so that users can reproduce the official variance estimation procedure. Stata’s svyset can be configured to use the paired jackknife replicate weights; the survey package can do the same.

The syntax to generate the same result is:

options(survey.lonely.psu = 'adjust') 
# Set up survey design with JK2 replicate weights 
# scale = 1/45 matches the mean-based JK2 variance formula used by Rrepest/repest 
des2 = svrepdesign(
  weights          = ~rwgt0,
  repweights       = as.matrix(g2[, paste0('rwgt', 1:45)]),   # pass the replicate-weight columns as a matrix
  type             = 'other',
  scale            = 1,
  rscales          = rep(1, 45),
  combined.weights = TRUE,
  mse              = TRUE,
  data             = g2
)
svymean(~qe22, subset(des2, ID_PAYS == 1 & qe22 < 99), na.rm = TRUE)
       mean     SE
qe22 7.5665 0.0599

8.6.2 Calculating mean mathematics proficiency

Unlike age, mathematics proficiency is represented by five plausible values. Rrepest estimates the mean separately for each plausible value, combines the five estimates using multiple-imputation formulas, and then incorporates the replicate-weight variance to produce the final standard error.

Now we will use the \(5\) plausible values to estimate mean mathematics scores in Benin.

Plausible values are a set of multiple imputations. Rrepest automatically recognises plausible values when the variable name contains the \(@\) symbol.

For example:

est('mean', target = 'MATHS_PV@') 

tells `Rrepest to analyse all five mathematics plausible values; combine results appropriately; and calculate standard errors that reflect both sampling and measurement uncertainty.

This allows researchers to obtain valid estimates without having to implement the multiple-imputation calculations manually.

# Calculating mean mathematics scores in Benin 

result_maths = Rrepest( 
  data       = subset(g2, ID_PAYS == 1), 
  svy        = 'SVY', 
  est        = est('mean', target = 'MATHS_PV@'), 
  cm.weights = cm2, 
  var.factor = 45, 
  n.pvs      = 5 
) 
print(result_maths) 
# A tibble: 1 × 3
  .     `b.mean.maths_pv@` `se.mean.maths_pv@`
  <chr>              <dbl>               <dbl>
1 Total               525.                7.16

The estimated mean mathematics proficiency in Benin is \(525\) points. The standard error of \(7.16\) reflects both sampling uncertainty and uncertainty arising from the plausible values. The confidence interval indicates that the population mean is likely to lie between \(511\) and \(539\) points.

PASEC proficiency scales are scaled using IRT procedures and should primarily be interpreted comparatively. The absolute value of the scale has no substantive meaning independent of the proficiency framework.

Alternative to generate the same result using survey and mitools package

An alternative approach using the survey and mitools packages reproduces the same result by looping over plausible values manually.

# withPV repeats svymean for each PV then MIcombine applies Rubin's rules 
pv_results = withPV(
  mapping = list(maths ~ MATHS_PV1 + MATHS_PV2 + MATHS_PV3 + MATHS_PV4 + MATHS_PV5),
  data    = subset(des2, ID_PAYS == 1),
  action  = function(dsgn) svymean(~maths, dsgn, na.rm = TRUE),
  rewrite = TRUE
)
summary(MIcombine(pv_results))
Multiple imputation results:
      withPV.svyrep.design(mapping = list(maths ~ MATHS_PV1 + MATHS_PV2 + 
    MATHS_PV3 + MATHS_PV4 + MATHS_PV5), data = subset(des2, ID_PAYS == 
    1), action = function(dsgn) svymean(~maths, dsgn, na.rm = TRUE), 
    rewrite = TRUE)
      MIcombine.default(pv_results)
           results       se   (lower   upper) missInfo
MATHS_PV1 525.0697 7.159119 511.0378 539.1016      1 %

Now let’s get means of mathematics and literacy for each country by including the by argument.

# Means of mathematics and literacy for each country 
result_by_country = Rrepest( 
  data       = g2, 
  svy        = 'SVY', 
  est        = est('mean', target = c('MATHS_PV@', 'LECT_PV@')), 
  by         = 'ID_PAYS', 
  cm.weights = cm2, 
  var.factor = 45, 
  n.pvs      = 5 
) 
print(result_by_country) 
# A tibble: 14 × 5
   ID_PAYS                      `b.mean.maths_pv@` `se.mean.maths_pv@` `b.mean.lect_pv@` `se.mean.lect_pv@`
   <chr>                                     <dbl>               <dbl>             <dbl>              <dbl>
 1 Benin                                      525.                7.16              525.               7.71
 2 Burkina Faso                               499.                8.23              493.               9.75
 3 Burundi                                    614.                2.37              625.               4.53
 4 Cameroun                                   517.                7.99              522.               8.39
 5 Chad                                       522.                6.84              508.               7.80
 6 Congo                                      592.                6.30              582.               7.48
 7 Cote D'Ivoire                              523.                4.06              517.               5.40
 8 Democratic Republic of Congo               568.                8.24              531.              10.5 
 9 Gabon                                      596.                9.40              610.              14.5 
10 Guinee                                     519.                9.40              469.              10.3 
11 Madagascar                                 550.                3.80              569.               6.90
12 Niger                                      545.                6.36              535.               7.19
13 Senegal                                    563.                6.08              557.               9.34
14 Togo                                       489.                5.29              475.               7.20

There is substantial variation in achievement across participating countries. Burundi records the highest Grade 2 mathematics score (\(614\) points), while Togo records the lowest (\(489\) points). Reading and mathematics performance are strongly correlated across countries, although some countries perform relatively better in one domain than the other. Formal statistical comparisons between countries should be conducted using regression models rather than visual inspection of the means alone.

Notice that standard errors vary considerably across countries — Burundi (country 3) has notably smaller standard errors than countries such as Gabon (7) or Guinea (8). This reflects differences in the homogeneity of performance within schools and the precision of the sampling design in each country.

8.6.3 Moving beyond means: summary statistics with Rrepest

Rrepesthas built-in support for a range of statistics beyond the mean, including percentiles and standard deviations. The est() function with multiple statistics generates point estimates and standard errors for a range of distribution summaries.

# First create an indicator for girl 
g2$girl = ifelse(g2$qe23 < 9, as.integer(g2$qe23 == 2), NA) 
# Beyond means - looking at the distribution of mathematics scores in Benin by gender 
# Quantile levels: 0.05, 0.25, 0.50, 0.75, 0.95 
result_dist = Rrepest( 
  data       = subset(g2, ID_PAYS == 1), 
  svy        = 'SVY', 
  est        = est(c('mean','std','quant',0.05,'quant',0.25,'quant',0.50,'quant',0.75,'quant',0.95), 
                   target = 'MATHS_PV@'), 
  by         = 'girl', 
  cm.weights = cm2, 
  var.factor = 45, 
  n.pvs      = 5 
) 

print(result_dist) 
# A tibble: 2 × 15
  girl  `b.mean.maths_pv@` `se.mean.maths_pv@` `b.std.maths_pv@` `se.std.maths_pv@` `b.quant00.maths_pv@` `se.quant00.maths_pv@` `b.quant025.maths_pv@`
  <chr>              <dbl>               <dbl>             <dbl>              <dbl>                 <dbl>                  <dbl>                  <dbl>
1 0                   536.                7.35              107.               7.38                  372.                   9.42                   460.
2 1                   513.                7.80              101.               6.81                  364.                  10.5                    441.
# ℹ 7 more variables: `se.quant025.maths_pv@` <dbl>, `b.quant05.maths_pv@` <dbl>, `se.quant05.maths_pv@` <dbl>, `b.quant075.maths_pv@` <dbl>,
#   `se.quant075.maths_pv@` <dbl>, `b.quant095.maths_pv@` <dbl>, `se.quant095.maths_pv@` <dbl>

Looking beyond average performance reveals important differences across the distribution. Boys in Benin have higher mathematics scores at the mean, median and upper tail of the distribution. For example, the median score among boys is approximately \(536\) points compared with \(508\) points among girls. The percentile estimates show that the gender gap is not confined to a small group of high-performing students but is evident across much of the achievement distribution.

Quantiles divide a distribution into equal-sized groups and are commonly used to examine how achievement is distributed across students. The code below produces the quintile cut-points (\(20th\), \(40th\), \(60th\), \(80th\) percentiles) for mathematics and literacy in Benin.

# Quintile boundaries for mathematics and language in Benin 
result_quint = Rrepest( 
  data       = subset(g2, ID_PAYS == 1), 
  svy        = 'SVY', 
  est        = est(c('quant',0.20,'quant',0.40,'quant',0.60,'quant',0.80), 
                   target = c('MATHS_PV@','LECT_PV@')), 
  cm.weights = cm2, 
  var.factor = 45, 
  n.pvs      = 5 
) 
print(result_quint) 
# A tibble: 1 × 17
  .     `b.quant02.maths_pv@` `se.quant02.maths_pv@` `b.quant04.maths_pv@` `se.quant04.maths_pv@` `b.quant06.maths_pv@` `se.quant06.maths_pv@`
  <chr>                 <dbl>                  <dbl>                 <dbl>                  <dbl>                 <dbl>                  <dbl>
1 Total                  435.                   6.68                  495.                   8.07                  548.                   7.46
# ℹ 10 more variables: `b.quant08.maths_pv@` <dbl>, `se.quant08.maths_pv@` <dbl>, `b.quant02.lect_pv@` <dbl>, `se.quant02.lect_pv@` <dbl>,
#   `b.quant04.lect_pv@` <dbl>, `se.quant04.lect_pv@` <dbl>, `b.quant06.lect_pv@` <dbl>, `se.quant06.lect_pv@` <dbl>, `b.quant08.lect_pv@` <dbl>,
#   `se.quant08.lect_pv@` <dbl>

The quintile boundaries divide the mathematics and reading distributions into five equal-sized groups. The reported values correspond to the score thresholds separating adjacent quintiles. For example, students scoring above approximately \(576\) points in mathematics fall in the highest-performing quintile in Benin.

8.6.4 Estimating PASEC proficiency levels

PASEC defines a set of proficiency levels that describe what students know and can do at different points along the mathematics scale. In this example, we estimate the proportion of Grade 2 students in Benin who fall into each of the four PASEC mathematics proficiency levels.

The four proficiency levels for Grade 2 mathematics are defined by PASEC as follows:

PASEC Proficiency Level Score
1 \(< 400.34\)
2 \(400.34 – 489.03\)
3 \(489.03–577.73\)
4 \(>577.73\)

These boundaries are set on the PASEC international scale and are the same for all countries, enabling cross-country comparisons. If we want to provide estimates for the PASEC defined levels of proficiency, we first need to create dummy variables for each level.

# Create proficiency-level indicator variables for each plausible value 
for (i in 1:5) { 
  v = paste0('MATHS_PV', i) 
  x = g2[[v]] 
  g2[[paste0('prop1_', v)]] = ifelse(!is.na(x), as.integer(x <= 400.34),              NA) 
  g2[[paste0('prop2_', v)]] = ifelse(!is.na(x), as.integer(x >  400.34 & x <= 489.03), NA) 
  g2[[paste0('prop3_', v)]] = ifelse(!is.na(x), as.integer(x >  489.03 & x <  577.73), NA) 
  g2[[paste0('prop4_', v)]] = ifelse(!is.na(x), as.integer(x >  577.73),              NA) 
} 
# Mean proportion in each proficiency level for Benin 
result_levels = Rrepest( 
  data       = subset(g2, ID_PAYS == 1), 
  svy        = 'SVY', 
  est        = est('mean', target = c('prop1_MATHS_PV@','prop2_MATHS_PV@', 
                                      'prop3_MATHS_PV@','prop4_MATHS_PV@')), 
  cm.weights = cm2, 
  var.factor = 45, 
  n.pvs      = 5 
) 
print(result_levels) 
# A tibble: 1 × 9
  .     `b.mean.prop1_maths_pv@` `se.mean.prop1_maths_pv@` `b.mean.prop2_maths_pv@` `se.mean.prop2_maths_pv@` `b.mean.prop3_maths_pv@` `se.mean.prop3_maths_pv@`
  <chr>                    <dbl>                     <dbl>                    <dbl>                     <dbl>                    <dbl>                     <dbl>
1 Total                    0.110                    0.0136                    0.271                    0.0186                    0.331                    0.0208
# ℹ 2 more variables: `b.mean.prop4_maths_pv@` <dbl>, `se.mean.prop4_maths_pv@` <dbl>

The results indicate that approximately \(11\%\) of Grade 2 students in Benin fall below Level 1 (the lowest proficiency threshold), while around \(29\%\) reach the highest level. The largest share of students falls in the middle proficiency levels. Because these estimates are derived from plausible values, the reported standard errors incorporate both measurement and sampling uncertainty.

8.6.5 Testing for differences between groups

The table below highlights the distinction between two common types of comparisons in PASEC data: within-country comparisons (e.g., boys versus girls in the same country) and between-country comparisons (e.g., Benin versus Burkina Faso). Because these comparisons involve different survey structures, they require slightly different analytical approaches.

The examples that follow demonstrate how to test for differences in achievement between groups using regression models in repest.

Comparison Type What it measures Survey design impact
Within-Country (e.g., Boys vs. Girls in Senegal) The gap between two demographic subgroups who share the same sampling strata, schools, and teachers. High covariance. Because groups are clustered together in the same schools, their errors are correlated.
Between-Country (e.g., Senegal vs Benin) The gap between two entirely independent populations with completely separate sampling frames Zero covariance. Sampling units in Country A have no mathematical relationship to sampling units in Country B

In Rrepest you should use over for within-country comparisons and by for between-country comparisons. You must NOT use over for countries. When using svy = 'SVY', the over() option is designed for within-country subgroup comparisons only. Using regression achieves the same goal and works correctly with the SVY option.

8.6.5.1 Testing for differences within countries - differences between boys and girls

Are there differences in mathematics scores between boys and girls? We can use linear regression to test for this.

# Test for differences between boys and girls in mathematics scores in Benin
result_gender = Rrepest(
  data       = subset(g2, ID_PAYS == 1),
  svy        = 'SVY',
  est        = est('lm', target = 'MATHS_PV@', regressor = 'girl'),
  cm.weights = cm2,
  var.factor = 45,
  n.pvs      = 5
)
print(result_gender)
# A tibble: 1 × 7
  .     `b.reg_maths_pv@.intercept` `se.reg_maths_pv@.intercept` `b.reg_maths_pv@.girl` `se.reg_maths_pv@.girl` `b.reg_maths_pv@.rsqr` `se.reg_maths_pv@.rsqr`
  <chr>                       <dbl>                        <dbl>                  <dbl>                   <dbl>                  <dbl>                   <dbl>
1 Total                        536.                         7.35                  -22.7                    4.89                 0.0117                 0.00512

The coefficient on girl provides the estimated difference in mathematics proficiency between girls and boys, together with the appropriate standard error and significance test. On average, girls’ scores are \(22.7\) points lower than boys in Benin. What about differences in the proportion reaching PASEC level 4?

# Test differences between boys and girls in percentage reaching PASEC level 4 in Benin
result_lev4 = Rrepest(
  data       = subset(g2, ID_PAYS == 1),
  svy        = 'SVY',
  est        = est('lm', target = 'prop4_MATHS_PV@', regressor = 'girl'),
  cm.weights = cm2,
  var.factor = 45,
  n.pvs      = 5
)
print(result_lev4)
# A tibble: 1 × 7
  .     `b.reg_prop4_maths_pv@.intercept` se.reg_prop4_maths_pv@.int…¹ b.reg_prop4_maths_pv…² se.reg_prop4_maths_p…³ b.reg_prop4_maths_pv…⁴ se.reg_prop4_maths_p…⁵
  <chr>                             <dbl>                        <dbl>                  <dbl>                  <dbl>                  <dbl>                  <dbl>
1 Total                             0.329                       0.0268                -0.0850                 0.0286                0.00898                0.00596
# ℹ abbreviated names: ¹​`se.reg_prop4_maths_pv@.intercept`, ²​`b.reg_prop4_maths_pv@.girl`, ³​`se.reg_prop4_maths_pv@.girl`, ⁴​`b.reg_prop4_maths_pv@.rsqr`,
#   ⁵​`se.reg_prop4_maths_pv@.rsqr`

Girls are \(8.5\) percentage points less likely to reach the highest proficiency level in mathematics than boys in Benin.

8.6.5.2 Testing for differences between countries

Are there differences in mean mathematics scores between Benin and Burkina Faso?

# Test differences in mathematics scores between Benin and Burkina Faso
# Convert ID_PAYS to factor so R creates an indicator for country 2
g2$ID_PAYS_f =factor(g2$ID_PAYS)
result_between = Rrepest(
  data       = subset(g2, ID_PAYS <= 2),
  svy        = 'SVY',
  est        = est('lm', target = 'MATHS_PV@', regressor = 'ID_PAYS_f'),
  cm.weights = cm2,
  var.factor = 45,
  n.pvs      = 5
)
print(result_between)
# A tibble: 1 × 7
  .     `b.reg_maths_pv@.intercept` `se.reg_maths_pv@.intercept` `b.reg_maths_pv@.id_pays_f` se.reg_maths_pv@.id_p…¹ `b.reg_maths_pv@.rsqr` se.reg_maths_pv@.rsq…²
  <chr>                       <dbl>                        <dbl>                       <dbl>                   <dbl>                  <dbl>                  <dbl>
1 Total                        551.                         16.8                       -26.4                    11.1                 0.0139                 0.0107
# ℹ abbreviated names: ¹​`se.reg_maths_pv@.id_pays_f`, ²​`se.reg_maths_pv@.rsqr`

Mathematics proficiency levels are \(26.4\) points lower in Burkina Faso compared to Benin, a statistically significant difference.

8.6.6 Multivariate regression

Now let’s use the Grade 6 data. There are \(90\) replicate weights so we will use cm6 instead of cm2. In this example, we’ll analyse the association between students’ maths scores (MATHS_PV@) and their gender (qe63) and socio-economic status (ses) in Côte d’Ivoire. The ses variable is a composite socioeconomic status index derived from the PASEC household questionnaire.

# Load Grade 6 data
g6 = read_dta("data/PASEC2019_Grade6_INT.dta")

# Apply English labels
source("data/PASEC2019_Grade6_EN_labels.R")
g6 = apply_pasec_grade6_en_labels(g6)

# Generate indicator variable for girl (Grade 6 uses qe63 for gender)
g6$girl = ifelse(g6$qe63 < 9, as.integer(g6$qe63 == 2), NA)

# Association between mathematics scores and socio-economic status and gender
result_multi = Rrepest(
  data       = subset(g6, ID_PAYS == 6),
  svy        = 'SVY',
  est        = est('lm', target = 'MATHS_PV@', regressor = c('ses', 'girl')),
  cm.weights = cm6,
  var.factor = 90,
  n.pvs      = 5
)
print(result_multi)
# A tibble: 1 × 9
  .     `b.reg_maths_pv@.intercept` `se.reg_maths_pv@.intercept` `b.reg_maths_pv@.ses` `se.reg_maths_pv@.ses` `b.reg_maths_pv@.girl` `se.reg_maths_pv@.girl`
  <chr>                       <dbl>                        <dbl>                 <dbl>                  <dbl>                  <dbl>                   <dbl>
1 Total                        349.                         18.7                  2.17                  0.388                  -11.6                    2.86
# ℹ 2 more variables: `b.reg_maths_pv@.rsqr` <dbl>, `se.reg_maths_pv@.rsqr` <dbl>

Rrepest reports R-squared automatically in the regression output (b.reg_…_rsqr column). The number of observations can be obtained separately with

nrow(subset(g6, ID_PAYS == 6 & !is.na(girl) & !is.na(ses)))

The regression estimates the association between mathematics proficiency, socioeconomic status (SES), and gender among Grade 6 students in Côte d’Ivoire while correctly accounting for both plausible values and the PASEC replicate-weight design.

The results indicate a strong positive association between socioeconomic status and mathematics achievement. A one-unit increase in the SES index is associated with an increase of approximately 2.17 points in mathematics proficiency (coefficient \(= 2.17\), p \(< 0.001\)). This relationship is statistically significant, suggesting that students from more advantaged socioeconomic backgrounds tend to perform better in mathematics.

The coefficient on the female indicator (girl) is \(-11.62\) (p \(< 0.001\)), indicating that girls score, on average, about \(12\) points lower than boys with similar socioeconomic status. Because SES is included in the model, this estimated gender gap reflects differences between boys and girls after accounting for socioeconomic background.

The constant term (\(349.0\)) represents the predicted mathematics score for a boy with an SES value of zero. While the intercept is necessary for estimation, it is typically of limited substantive interest because an SES value of zero may not correspond to a meaningful student profile.

Overall, the results suggest that both socioeconomic status and gender are important correlates of mathematics achievement in Côte d’Ivoire. Higher socioeconomic status is associated with better performance, while girls perform less well than boys on average, even after controlling for socioeconomic differences.

It is important to remember that these coefficients describe associations rather than causal effects. The regression does not imply that increasing a student’s socioeconomic status by one unit would necessarily increase their mathematics score by \(2.17\) points. Other factors associated with socioeconomic status, such as school quality, parental education, household resources, and learning opportunities, may also contribute to the observed relationship.