8 Analyzing PASEC 2019 data in R
This section provides a practical introduction to analysing PASEC 2019 data in R. It is intended for users with a basic working knowledge of R and includes step-by-step examples covering common PASEC analyses.
Use the contents below to navigate directly to the topics most relevant to your analysis.
8.1 Package installation
To get started, you need to install the required packages once. In the R console, type:
- The
havenpackage is needed to read Stata files. - The
Rrepestpackage automates the handling of plausible values and replicate weights, making it easier to analyse PASEC data correctly in R. It supports a range of common analyses, including means, summary statistics, frequency distributions, quantiles, correlations, group comparisons, and regression models. Throughout this guide we introduce the commands needed for common PASEC analyses. - The
surveyandmitoolspackages are used only for the alternatives analyses shown in …. - While the
tidyversepackage is for data wrangling.
You only need to run install.packages() once; there is no need to reinstall each time you open R.
8.2 Loading data in R
For hands-on examples we’ll start with the Grade 2 study.
It is good practice to run your analysis from an R script rather than the console, as this makes your work reproducible. All examples in this section are written as script code.
8.3 Applying English Labels
To encourage and assist non-French speaking users to analyse the rich PASEC data, we have supplied R scripts and do-files that apply English variable and value labels. These files are in the data/ folder of this project.
# Apply English labels
source("data/PASEC2019_Grade2_EN_labels.R")
g2 = apply_pasec_grade2_en_labels(g2)After loading the file, let’s check if we have the most important variables that Rrepest uses. The five plausible values for language and mathematics are LECT_PV1 to LECT_PV5 and MATHS_PV1 to MATHS_PV5. The final weight is rwgt0. The replicate weights are rwgt1 to rwgt45 and countries are indicated by ID_PAYS.
# Confirm key variables exist
key_vars <- c('LECT_PV1','LECT_PV5','MATHS_PV1','MATHS_PV5',
'ID_PAYS','rwgt0','rwgt1','rwgt45')
stopifnot(all(key_vars %in% names(g2)))
# Quick look at structure
g2 |>
select(all_of(key_vars)) |>
glimpse() Rows: 21,933
Columns: 8
$ LECT_PV1 <dbl> 640.5484, 752.6683, 533.4055, 452.2114, 653.5640, 567.1990, 791.0265, 601.9592, 643.6853, 597.2301, 631.8771, 671.0275, 711.6534, 536.4511, 66…
$ LECT_PV5 <dbl> 681.0913, 750.5060, 520.3950, 454.3549, 670.3465, 584.3690, 703.6523, 583.4270, 630.7553, 640.1383, 646.9531, 649.6097, 711.6534, 540.5448, 62…
$ MATHS_PV1 <dbl> 627.7454, 686.3843, 492.4545, 456.3050, 699.5146, 594.5303, 605.2292, 625.7197, 583.2004, 598.8627, 672.5724, 661.6907, 688.4756, 504.6280, 63…
$ MATHS_PV5 <dbl> 649.4599, 656.8358, 475.3191, 481.9290, 663.3732, 612.6745, 615.9355, 649.5237, 555.9525, 618.9540, 651.0314, 645.4049, 688.4756, 470.7990, 60…
$ ID_PAYS <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, …
$ rwgt0 <dbl> 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 15…
$ rwgt1 <dbl> 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 306.6116, 30…
$ rwgt45 <dbl> 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 153.3058, 15…
8.4 Rrepest command syntax
The basic syntax of the Rrepest() function is as follows:
Rrepest(
data = <data frame or subset>,
svy = <survey name>,
est = est(<statistic>, target = <variable(s)>, regressor = <variable(s)>),
by = <grouping variable(s)>, # separate analysis per group
over = <subgroup variable(s)>, # within-group comparison
cm.weights = c(<final weight>, <replicate weights>),
var.factor = <variance scaling constant>,
n.pvs = <number of plausible values>
) Once these survey settings have been specified, you can use Rrepest to estimate means, summary statistics, proficiency levels, group differences, percentiles, and regression models while correctly accounting for both plausible values and replicate weights.
8.5 Before you begin: Set up Rrepest for PASEC
PASEC is not one of the studies supported by the Rrepest package meaning that there aren’t built-in survey specifications. Therefore, before running any analyses, you need to tell Rrepest how the PASEC data are structured.
You will need to use svy = 'SVY' and supply the survey parameters directly. To analyse PASEC data correctly, Rrepest needs to know the following survey specifications:
PASEC uses the paired Jackknife method for creating the replicate weights (JK2).
There are 90 replicate weights in the Grade 6 data and 45 replicate weights in the Grade 2 data.
The final weight is given by rwgt0
The replicate weights are rwgt1 to rwgt90 in the Grade 6 data and rwgt1 to rwgt45 in the Grade 2 data.
There are five sets of plausible values for each of mathematics, MATHS_PV1 to MATHS_PV5 and language LECT_PV1 to LECT_PV5.
We need to define weight vectors once and reuse them throughout your analyses:
# Weight helpers — define once, use everywhere
cm2 = c('rwgt0', paste0('rwgt', 1:45)) # Grade 2: final + 45 replicate weights
cm6 = c('rwgt0', paste0('rwgt', 1:90)) # Grade 6: final + 90 replicate weights Command for the grade 2 data will have the following syntax
Rrepest(
data = subset(g2, <condition>),
svy = 'SVY',
est = est(<statistic>, target = <variable>),
cm.weights = cm2,
var.factor = 45,
n.pvs = 5
) For Grade 6, we need to specify that there are 90 replicate weights. Commands for grade 6 as follows
8.6 PASEC Analyses Examples
8.6.1 Calculating Mean Age by Country
Let’s start by calculating the average age of students in Grade 2 in Benin. The age variable is qe22. Let’s first inspect the ID_PAYS and qe22 variables.
# A tibble: 14 × 2
ID_PAYS n
<dbl+lbl> <int>
1 1 [Benin] 1654
2 2 [Burkina Faso] 1884
3 3 [Burundi] 1664
4 4 [Cameroun] 1780
5 5 [Congo] 1553
6 6 [Cote D'Ivoire] 1332
7 7 [Gabon] 1157
8 8 [Guinee] 1086
9 9 [Madagascar] 1883
10 10 [Niger] 1730
11 11 [Democratic Republic of Congo] 1050
12 12 [Senegal] 1341
13 13 [Chad] 1727
14 14 [Togo] 2092
# A tibble: 19 × 2
qe22 n
<dbl+lbl> <int>
1 4 2
2 5 83
3 6 1295
4 7 5467
5 8 6923
6 9 4261
7 10 2099
8 11 800
9 12 458
10 13 189
11 14 73
12 15 41
13 16 19
14 17 8
15 18 1
16 19 1
17 20 1
18 22 1
19 99 [Missing] 211
NOTE: Benin is coded as \(1\). Missing age values are coded as \(99\). Because \(99\) is a placeholder used to indicate missing data rather than a learner’s actual age, these observations should be excluded from analyses involving age. Failure to do so may produce misleading results.
To calculate the average age of students in Grade 2 in Benin, the syntax is as follows:
# Calculating average age in Benin
result_age = Rrepest(
data = subset(g2, ID_PAYS == 1 & qe22 < 99),
svy = 'SVY',
est = est('mean', target = 'qe22'),
cm.weights = cm2,
var.factor = 45
)
print(result_age) # A tibble: 1 × 3
. b.mean.qe22 se.mean.qe22
<chr> <dbl> <dbl>
1 Total 7.57 0.0599
Because age is an observed variable rather than a plausible-value variable, Rrepest uses only the sampling variance derived from the replicate weights. The reported standard error therefore reflects uncertainty arising from the sample design but not measurement uncertainty.
The estimated mean age of Grade 2 students in Benin is \(7.57\) years (\(95\%\) CI: \(7.45–7.68\)). This reflects the substantial grade repetition and late entry common in the region, with many students aged \(9–11\) also enrolled in Grade 2.
PASEC provides replicate weights specifically so that users can reproduce the official variance estimation procedure. Stata’s svyset can be configured to use the paired jackknife replicate weights; the survey package can do the same.
The syntax to generate the same result is:
options(survey.lonely.psu = 'adjust')
# Set up survey design with JK2 replicate weights
# scale = 1/45 matches the mean-based JK2 variance formula used by Rrepest/repest
des2 = svrepdesign(
weights = ~rwgt0,
repweights = as.matrix(g2[, paste0('rwgt', 1:45)]), # pass the replicate-weight columns as a matrix
type = 'other',
scale = 1,
rscales = rep(1, 45),
combined.weights = TRUE,
mse = TRUE,
data = g2
)
svymean(~qe22, subset(des2, ID_PAYS == 1 & qe22 < 99), na.rm = TRUE) mean SE
qe22 7.5665 0.0599
8.6.2 Calculating mean mathematics proficiency
Unlike age, mathematics proficiency is represented by five plausible values. Rrepest estimates the mean separately for each plausible value, combines the five estimates using multiple-imputation formulas, and then incorporates the replicate-weight variance to produce the final standard error.
Now we will use the \(5\) plausible values to estimate mean mathematics scores in Benin.
Plausible values are a set of multiple imputations. Rrepest automatically recognises plausible values when the variable name contains the \(@\) symbol.
For example:
tells `Rrepest to analyse all five mathematics plausible values; combine results appropriately; and calculate standard errors that reflect both sampling and measurement uncertainty.
This allows researchers to obtain valid estimates without having to implement the multiple-imputation calculations manually.
# Calculating mean mathematics scores in Benin
result_maths = Rrepest(
data = subset(g2, ID_PAYS == 1),
svy = 'SVY',
est = est('mean', target = 'MATHS_PV@'),
cm.weights = cm2,
var.factor = 45,
n.pvs = 5
)
print(result_maths) # A tibble: 1 × 3
. `b.mean.maths_pv@` `se.mean.maths_pv@`
<chr> <dbl> <dbl>
1 Total 525. 7.16
The estimated mean mathematics proficiency in Benin is \(525\) points. The standard error of \(7.16\) reflects both sampling uncertainty and uncertainty arising from the plausible values. The confidence interval indicates that the population mean is likely to lie between \(511\) and \(539\) points.
PASEC proficiency scales are scaled using IRT procedures and should primarily be interpreted comparatively. The absolute value of the scale has no substantive meaning independent of the proficiency framework.
An alternative approach using the survey and mitools packages reproduces the same result by looping over plausible values manually.
# withPV repeats svymean for each PV then MIcombine applies Rubin's rules
pv_results = withPV(
mapping = list(maths ~ MATHS_PV1 + MATHS_PV2 + MATHS_PV3 + MATHS_PV4 + MATHS_PV5),
data = subset(des2, ID_PAYS == 1),
action = function(dsgn) svymean(~maths, dsgn, na.rm = TRUE),
rewrite = TRUE
)
summary(MIcombine(pv_results))Multiple imputation results:
withPV.svyrep.design(mapping = list(maths ~ MATHS_PV1 + MATHS_PV2 +
MATHS_PV3 + MATHS_PV4 + MATHS_PV5), data = subset(des2, ID_PAYS ==
1), action = function(dsgn) svymean(~maths, dsgn, na.rm = TRUE),
rewrite = TRUE)
MIcombine.default(pv_results)
results se (lower upper) missInfo
MATHS_PV1 525.0697 7.159119 511.0378 539.1016 1 %
Now let’s get means of mathematics and literacy for each country by including the by argument.
# Means of mathematics and literacy for each country
result_by_country = Rrepest(
data = g2,
svy = 'SVY',
est = est('mean', target = c('MATHS_PV@', 'LECT_PV@')),
by = 'ID_PAYS',
cm.weights = cm2,
var.factor = 45,
n.pvs = 5
)
print(result_by_country) # A tibble: 14 × 5
ID_PAYS `b.mean.maths_pv@` `se.mean.maths_pv@` `b.mean.lect_pv@` `se.mean.lect_pv@`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Benin 525. 7.16 525. 7.71
2 Burkina Faso 499. 8.23 493. 9.75
3 Burundi 614. 2.37 625. 4.53
4 Cameroun 517. 7.99 522. 8.39
5 Chad 522. 6.84 508. 7.80
6 Congo 592. 6.30 582. 7.48
7 Cote D'Ivoire 523. 4.06 517. 5.40
8 Democratic Republic of Congo 568. 8.24 531. 10.5
9 Gabon 596. 9.40 610. 14.5
10 Guinee 519. 9.40 469. 10.3
11 Madagascar 550. 3.80 569. 6.90
12 Niger 545. 6.36 535. 7.19
13 Senegal 563. 6.08 557. 9.34
14 Togo 489. 5.29 475. 7.20
There is substantial variation in achievement across participating countries. Burundi records the highest Grade 2 mathematics score (\(614\) points), while Togo records the lowest (\(489\) points). Reading and mathematics performance are strongly correlated across countries, although some countries perform relatively better in one domain than the other. Formal statistical comparisons between countries should be conducted using regression models rather than visual inspection of the means alone.
Notice that standard errors vary considerably across countries — Burundi (country 3) has notably smaller standard errors than countries such as Gabon (7) or Guinea (8). This reflects differences in the homogeneity of performance within schools and the precision of the sampling design in each country.
8.6.3 Moving beyond means: summary statistics with Rrepest
Rrepesthas built-in support for a range of statistics beyond the mean, including percentiles and standard deviations. The est() function with multiple statistics generates point estimates and standard errors for a range of distribution summaries.
# First create an indicator for girl
g2$girl = ifelse(g2$qe23 < 9, as.integer(g2$qe23 == 2), NA)
# Beyond means - looking at the distribution of mathematics scores in Benin by gender
# Quantile levels: 0.05, 0.25, 0.50, 0.75, 0.95
result_dist = Rrepest(
data = subset(g2, ID_PAYS == 1),
svy = 'SVY',
est = est(c('mean','std','quant',0.05,'quant',0.25,'quant',0.50,'quant',0.75,'quant',0.95),
target = 'MATHS_PV@'),
by = 'girl',
cm.weights = cm2,
var.factor = 45,
n.pvs = 5
)
print(result_dist) # A tibble: 2 × 15
girl `b.mean.maths_pv@` `se.mean.maths_pv@` `b.std.maths_pv@` `se.std.maths_pv@` `b.quant00.maths_pv@` `se.quant00.maths_pv@` `b.quant025.maths_pv@`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 536. 7.35 107. 7.38 372. 9.42 460.
2 1 513. 7.80 101. 6.81 364. 10.5 441.
# ℹ 7 more variables: `se.quant025.maths_pv@` <dbl>, `b.quant05.maths_pv@` <dbl>, `se.quant05.maths_pv@` <dbl>, `b.quant075.maths_pv@` <dbl>,
# `se.quant075.maths_pv@` <dbl>, `b.quant095.maths_pv@` <dbl>, `se.quant095.maths_pv@` <dbl>
Looking beyond average performance reveals important differences across the distribution. Boys in Benin have higher mathematics scores at the mean, median and upper tail of the distribution. For example, the median score among boys is approximately \(536\) points compared with \(508\) points among girls. The percentile estimates show that the gender gap is not confined to a small group of high-performing students but is evident across much of the achievement distribution.
Quantiles divide a distribution into equal-sized groups and are commonly used to examine how achievement is distributed across students. The code below produces the quintile cut-points (\(20th\), \(40th\), \(60th\), \(80th\) percentiles) for mathematics and literacy in Benin.
# Quintile boundaries for mathematics and language in Benin
result_quint = Rrepest(
data = subset(g2, ID_PAYS == 1),
svy = 'SVY',
est = est(c('quant',0.20,'quant',0.40,'quant',0.60,'quant',0.80),
target = c('MATHS_PV@','LECT_PV@')),
cm.weights = cm2,
var.factor = 45,
n.pvs = 5
)
print(result_quint) # A tibble: 1 × 17
. `b.quant02.maths_pv@` `se.quant02.maths_pv@` `b.quant04.maths_pv@` `se.quant04.maths_pv@` `b.quant06.maths_pv@` `se.quant06.maths_pv@`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Total 435. 6.68 495. 8.07 548. 7.46
# ℹ 10 more variables: `b.quant08.maths_pv@` <dbl>, `se.quant08.maths_pv@` <dbl>, `b.quant02.lect_pv@` <dbl>, `se.quant02.lect_pv@` <dbl>,
# `b.quant04.lect_pv@` <dbl>, `se.quant04.lect_pv@` <dbl>, `b.quant06.lect_pv@` <dbl>, `se.quant06.lect_pv@` <dbl>, `b.quant08.lect_pv@` <dbl>,
# `se.quant08.lect_pv@` <dbl>
The quintile boundaries divide the mathematics and reading distributions into five equal-sized groups. The reported values correspond to the score thresholds separating adjacent quintiles. For example, students scoring above approximately \(576\) points in mathematics fall in the highest-performing quintile in Benin.
8.6.4 Estimating PASEC proficiency levels
PASEC defines a set of proficiency levels that describe what students know and can do at different points along the mathematics scale. In this example, we estimate the proportion of Grade 2 students in Benin who fall into each of the four PASEC mathematics proficiency levels.
The four proficiency levels for Grade 2 mathematics are defined by PASEC as follows:
| PASEC Proficiency Level | Score |
|---|---|
| 1 | \(< 400.34\) |
| 2 | \(400.34 – 489.03\) |
| 3 | \(489.03–577.73\) |
| 4 | \(>577.73\) |
These boundaries are set on the PASEC international scale and are the same for all countries, enabling cross-country comparisons. If we want to provide estimates for the PASEC defined levels of proficiency, we first need to create dummy variables for each level.
# Create proficiency-level indicator variables for each plausible value
for (i in 1:5) {
v = paste0('MATHS_PV', i)
x = g2[[v]]
g2[[paste0('prop1_', v)]] = ifelse(!is.na(x), as.integer(x <= 400.34), NA)
g2[[paste0('prop2_', v)]] = ifelse(!is.na(x), as.integer(x > 400.34 & x <= 489.03), NA)
g2[[paste0('prop3_', v)]] = ifelse(!is.na(x), as.integer(x > 489.03 & x < 577.73), NA)
g2[[paste0('prop4_', v)]] = ifelse(!is.na(x), as.integer(x > 577.73), NA)
}
# Mean proportion in each proficiency level for Benin
result_levels = Rrepest(
data = subset(g2, ID_PAYS == 1),
svy = 'SVY',
est = est('mean', target = c('prop1_MATHS_PV@','prop2_MATHS_PV@',
'prop3_MATHS_PV@','prop4_MATHS_PV@')),
cm.weights = cm2,
var.factor = 45,
n.pvs = 5
)
print(result_levels) # A tibble: 1 × 9
. `b.mean.prop1_maths_pv@` `se.mean.prop1_maths_pv@` `b.mean.prop2_maths_pv@` `se.mean.prop2_maths_pv@` `b.mean.prop3_maths_pv@` `se.mean.prop3_maths_pv@`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Total 0.110 0.0136 0.271 0.0186 0.331 0.0208
# ℹ 2 more variables: `b.mean.prop4_maths_pv@` <dbl>, `se.mean.prop4_maths_pv@` <dbl>
The results indicate that approximately \(11\%\) of Grade 2 students in Benin fall below Level 1 (the lowest proficiency threshold), while around \(29\%\) reach the highest level. The largest share of students falls in the middle proficiency levels. Because these estimates are derived from plausible values, the reported standard errors incorporate both measurement and sampling uncertainty.
8.6.5 Testing for differences between groups
The table below highlights the distinction between two common types of comparisons in PASEC data: within-country comparisons (e.g., boys versus girls in the same country) and between-country comparisons (e.g., Benin versus Burkina Faso). Because these comparisons involve different survey structures, they require slightly different analytical approaches.
The examples that follow demonstrate how to test for differences in achievement between groups using regression models in repest.
| Comparison Type | What it measures | Survey design impact |
|---|---|---|
| Within-Country (e.g., Boys vs. Girls in Senegal) | The gap between two demographic subgroups who share the same sampling strata, schools, and teachers. | High covariance. Because groups are clustered together in the same schools, their errors are correlated. |
| Between-Country (e.g., Senegal vs Benin) | The gap between two entirely independent populations with completely separate sampling frames | Zero covariance. Sampling units in Country A have no mathematical relationship to sampling units in Country B |
In Rrepest you should use over for within-country comparisons and by for between-country comparisons. You must NOT use over for countries. When using svy = 'SVY', the over() option is designed for within-country subgroup comparisons only. Using regression achieves the same goal and works correctly with the SVY option.
8.6.5.1 Testing for differences within countries - differences between boys and girls
Are there differences in mathematics scores between boys and girls? We can use linear regression to test for this.
# Test for differences between boys and girls in mathematics scores in Benin
result_gender = Rrepest(
data = subset(g2, ID_PAYS == 1),
svy = 'SVY',
est = est('lm', target = 'MATHS_PV@', regressor = 'girl'),
cm.weights = cm2,
var.factor = 45,
n.pvs = 5
)
print(result_gender)# A tibble: 1 × 7
. `b.reg_maths_pv@.intercept` `se.reg_maths_pv@.intercept` `b.reg_maths_pv@.girl` `se.reg_maths_pv@.girl` `b.reg_maths_pv@.rsqr` `se.reg_maths_pv@.rsqr`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Total 536. 7.35 -22.7 4.89 0.0117 0.00512
The coefficient on girl provides the estimated difference in mathematics proficiency between girls and boys, together with the appropriate standard error and significance test. On average, girls’ scores are \(22.7\) points lower than boys in Benin. What about differences in the proportion reaching PASEC level 4?
# Test differences between boys and girls in percentage reaching PASEC level 4 in Benin
result_lev4 = Rrepest(
data = subset(g2, ID_PAYS == 1),
svy = 'SVY',
est = est('lm', target = 'prop4_MATHS_PV@', regressor = 'girl'),
cm.weights = cm2,
var.factor = 45,
n.pvs = 5
)
print(result_lev4)# A tibble: 1 × 7
. `b.reg_prop4_maths_pv@.intercept` se.reg_prop4_maths_pv@.int…¹ b.reg_prop4_maths_pv…² se.reg_prop4_maths_p…³ b.reg_prop4_maths_pv…⁴ se.reg_prop4_maths_p…⁵
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Total 0.329 0.0268 -0.0850 0.0286 0.00898 0.00596
# ℹ abbreviated names: ¹`se.reg_prop4_maths_pv@.intercept`, ²`b.reg_prop4_maths_pv@.girl`, ³`se.reg_prop4_maths_pv@.girl`, ⁴`b.reg_prop4_maths_pv@.rsqr`,
# ⁵`se.reg_prop4_maths_pv@.rsqr`
Girls are \(8.5\) percentage points less likely to reach the highest proficiency level in mathematics than boys in Benin.
8.6.5.2 Testing for differences between countries
Are there differences in mean mathematics scores between Benin and Burkina Faso?
# Test differences in mathematics scores between Benin and Burkina Faso
# Convert ID_PAYS to factor so R creates an indicator for country 2
g2$ID_PAYS_f =factor(g2$ID_PAYS)
result_between = Rrepest(
data = subset(g2, ID_PAYS <= 2),
svy = 'SVY',
est = est('lm', target = 'MATHS_PV@', regressor = 'ID_PAYS_f'),
cm.weights = cm2,
var.factor = 45,
n.pvs = 5
)
print(result_between)# A tibble: 1 × 7
. `b.reg_maths_pv@.intercept` `se.reg_maths_pv@.intercept` `b.reg_maths_pv@.id_pays_f` se.reg_maths_pv@.id_p…¹ `b.reg_maths_pv@.rsqr` se.reg_maths_pv@.rsq…²
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Total 551. 16.8 -26.4 11.1 0.0139 0.0107
# ℹ abbreviated names: ¹`se.reg_maths_pv@.id_pays_f`, ²`se.reg_maths_pv@.rsqr`
Mathematics proficiency levels are \(26.4\) points lower in Burkina Faso compared to Benin, a statistically significant difference.
8.6.6 Multivariate regression
Now let’s use the Grade 6 data. There are \(90\) replicate weights so we will use cm6 instead of cm2. In this example, we’ll analyse the association between students’ maths scores (MATHS_PV@) and their gender (qe63) and socio-economic status (ses) in Côte d’Ivoire. The ses variable is a composite socioeconomic status index derived from the PASEC household questionnaire.
# Load Grade 6 data
g6 = read_dta("data/PASEC2019_Grade6_INT.dta")
# Apply English labels
source("data/PASEC2019_Grade6_EN_labels.R")
g6 = apply_pasec_grade6_en_labels(g6)
# Generate indicator variable for girl (Grade 6 uses qe63 for gender)
g6$girl = ifelse(g6$qe63 < 9, as.integer(g6$qe63 == 2), NA)
# Association between mathematics scores and socio-economic status and gender
result_multi = Rrepest(
data = subset(g6, ID_PAYS == 6),
svy = 'SVY',
est = est('lm', target = 'MATHS_PV@', regressor = c('ses', 'girl')),
cm.weights = cm6,
var.factor = 90,
n.pvs = 5
)
print(result_multi)# A tibble: 1 × 9
. `b.reg_maths_pv@.intercept` `se.reg_maths_pv@.intercept` `b.reg_maths_pv@.ses` `se.reg_maths_pv@.ses` `b.reg_maths_pv@.girl` `se.reg_maths_pv@.girl`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Total 349. 18.7 2.17 0.388 -11.6 2.86
# ℹ 2 more variables: `b.reg_maths_pv@.rsqr` <dbl>, `se.reg_maths_pv@.rsqr` <dbl>
Rrepest reports R-squared automatically in the regression output (b.reg_…_rsqr column). The number of observations can be obtained separately with
The regression estimates the association between mathematics proficiency, socioeconomic status (SES), and gender among Grade 6 students in Côte d’Ivoire while correctly accounting for both plausible values and the PASEC replicate-weight design.
The results indicate a strong positive association between socioeconomic status and mathematics achievement. A one-unit increase in the SES index is associated with an increase of approximately 2.17 points in mathematics proficiency (coefficient \(= 2.17\), p \(< 0.001\)). This relationship is statistically significant, suggesting that students from more advantaged socioeconomic backgrounds tend to perform better in mathematics.
The coefficient on the female indicator (girl) is \(-11.62\) (p \(< 0.001\)), indicating that girls score, on average, about \(12\) points lower than boys with similar socioeconomic status. Because SES is included in the model, this estimated gender gap reflects differences between boys and girls after accounting for socioeconomic background.
The constant term (\(349.0\)) represents the predicted mathematics score for a boy with an SES value of zero. While the intercept is necessary for estimation, it is typically of limited substantive interest because an SES value of zero may not correspond to a meaningful student profile.
Overall, the results suggest that both socioeconomic status and gender are important correlates of mathematics achievement in Côte d’Ivoire. Higher socioeconomic status is associated with better performance, while girls perform less well than boys on average, even after controlling for socioeconomic differences.
It is important to remember that these coefficients describe associations rather than causal effects. The regression does not imply that increasing a student’s socioeconomic status by one unit would necessarily increase their mathematics score by \(2.17\) points. Other factors associated with socioeconomic status, such as school quality, parental education, household resources, and learning opportunities, may also contribute to the observed relationship.