7 Analyzing PASEC 2019 data in Stata
This section provides a practical introduction to analysing PASEC 2019 data in Stata using the repest package. It is intended for users with a basic working knowledge of Stata and includes step-by-step examples covering common PASEC analyses.
7.1 Loading data in Stata
For hands-on examples we’ll start with the Grade 2 study.
* Set the path to your working folder if needed.
* Paths below are relative to the project root.
use "data/PASEC2019_GRADE2_INT.dta", clear(Data file created by EpiData based on CIV_06_LIVRET_2A.rec)
It is good practice to run your analysis from a do-file rather than the command window, as this makes your work reproducible. All examples in this guide are written as do-file code available here.
7.2 Applying English labels
To encourage and assist non-French speaking users to analyse the rich PASEC data, we have supplied do files that apply English variable and value labels. These files are in the data/ folder of this project.
* Apply English labels
do "data/PASEC2019_Grade2_EN_labels.do"
* Show both numeric values and value labels
numlabel, add qd26y: all characters numeric; replaced as int
(835 missing values generated)
qd27y: all characters numeric; replaced as int
(835 missing values generated)
qe21y: all characters numeric; replaced as int
qe21m: all characters numeric; replaced as byte
qe21d: all characters numeric; replaced as byte
The numlabel, add command displays the numeric code alongside the value label in tabulations (for example, 1. Benin rather than just Benin). This makes it easier to write if conditions using the correct numeric codes in subsequent commands.
After loading the file, let’s check if we have the most important variables that repest uses. The five plausible values for language and mathematics are LECT_PV1 to LECT_PV5 and MATHS_PV1 to MATHS_PV5. The final weight is rwgt0. The replicate weights are rwgt1 to rwgt45 and countries are indicated by ID_PAYS.
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
LECT_PV1 double %10.0g First plausible value in reading
LECT_PV2 double %10.0g Second plausible value in reading
LECT_PV3 double %10.0g Third plausible value in reading
LECT_PV4 double %10.0g Fourth plausible value in reading
LECT_PV5 double %10.0g Fifth plausible value in reading
MATHS_PV1 double %10.0g First plausible value in
mathematics
MATHS_PV2 double %10.0g Second plausible value in
mathematics
MATHS_PV3 double %10.0g Third plausible value in
mathematics
MATHS_PV4 double %10.0g Fourth plausible value in
mathematics
MATHS_PV5 double %10.0g Fifth plausible value in
mathematics
ID_PAYS float %32.0g country Country identifier
rwgt0 double %10.0g Student replicate weight 0
rwgt1 double %10.0g Student replicate weight 1
rwgt2 double %10.0g Student replicate weight 2
rwgt3 double %10.0g Student replicate weight 3
rwgt4 double %10.0g Student replicate weight 4
rwgt5 double %10.0g Student replicate weight 5
rwgt6 double %10.0g Student replicate weight 6
rwgt7 double %10.0g Student replicate weight 7
rwgt8 double %10.0g Student replicate weight 8
rwgt9 double %10.0g Student replicate weight 9
rwgt10 double %10.0g Student replicate weight 10
rwgt11 double %10.0g Student replicate weight 11
rwgt12 double %10.0g Student replicate weight 12
rwgt13 double %10.0g Student replicate weight 13
rwgt14 double %10.0g Student replicate weight 14
rwgt15 double %10.0g Student replicate weight 15
rwgt16 double %10.0g Student replicate weight 16
rwgt17 double %10.0g Student replicate weight 17
rwgt18 double %10.0g Student replicate weight 18
rwgt19 double %10.0g Student replicate weight 19
rwgt20 double %10.0g Student replicate weight 20
rwgt21 double %10.0g Student replicate weight 21
rwgt22 double %10.0g Student replicate weight 22
rwgt23 double %10.0g Student replicate weight 23
rwgt24 double %10.0g Student replicate weight 24
rwgt25 double %10.0g Student replicate weight 25
rwgt26 double %10.0g Student replicate weight 26
rwgt27 double %10.0g Student replicate weight 27
rwgt28 double %10.0g Student replicate weight 28
rwgt29 double %10.0g Student replicate weight 29
rwgt30 double %10.0g Student replicate weight 30
rwgt31 double %10.0g Student replicate weight 31
rwgt32 double %10.0g Student replicate weight 32
rwgt33 double %10.0g Student replicate weight 33
rwgt34 double %10.0g Student replicate weight 34
rwgt35 double %10.0g Student replicate weight 35
rwgt36 double %10.0g Student replicate weight 36
rwgt37 double %10.0g Student replicate weight 37
rwgt38 double %10.0g Student replicate weight 38
rwgt39 double %10.0g Student replicate weight 39
rwgt40 double %10.0g Student replicate weight 40
rwgt41 double %10.0g Student replicate weight 41
rwgt42 double %10.0g Student replicate weight 42
rwgt43 double %10.0g Student replicate weight 43
rwgt44 double %10.0g Student replicate weight 44
rwgt45 double %10.0g Student replicate weight 45
7.3 The repest package
To get started, you need to install the repest package once. This package automates the handling of plausible values and replicate weights. This makes it easier to analyse PASEC data correctly in Stata. In the Stata command window, type:
You only need to run this command once; there is no need to re-install repest each time you open Stata. The replace option updates the package if a newer version is available.
7.4 repest command syntax
The basic syntax of the repest command is as follows:
svyname: Either one of the study names supported by the package (e,g., PISA, TIMSS, PIRLS) or SVY which allows you to specify the survey design.estimate(cmd [,cmd_options): Specifies the statistical command to run.cmdcan be any Stata command that accepts weights — for example,mean,reg,qreg, or the built-in repest commandsmeans,freq,summarize,corrandquantiletable. Command-specific options are passed after a comma within the parentheses.
7.5 Before You begin: Set up repest for PASEC
PASEC is not one of the studies supported by the repest package so you will need to use SVY for svyname and specify the survey parameters directly within the svyparm option.
Throughout this guide we use the SVY option in repest because PASEC is not one of the assessments with built-in survey specifications. All survey parameters must therefore be supplied explicitly using the svyparm() option.
PASEC uses the paired Jackknife method for creating the replicate weights. There are 90 replicate weights in the Grade 6 data and 45 replicate weights in the Grade 2 data. The final weight is given by rwgt0, the replicate weights are rwgt1 to rwgt90 in the Grade 6 data and rwgt1 to rwgt45 in the Grade 2 data. There are five sets of plausible values for each of mathematics, MATHS_PV1 to MATHS_PV5 and language LECT_PV1 to LECT_PV5.
The required parameters are as follows:
| Survey setting | svyparms() suboption |
PASEC 2019 Grade 6 | PASEC 2019 Grade 2 |
|---|---|---|---|
| Final weight | final_weight_name() |
rwgt0 | rwgt0 |
| Replicate weights | rep_weight_name() |
rwgt | rwgt |
| Variance factor | variancefactor() |
1 | 1 |
| Number of replications | NREP() |
90 | 45 |
| Number of plausible values | NBpv() |
5 | 5 |
Commands for analyzing the grade 2 data will have the following syntax
repest SVY [if] [in] , estimate(cmd [,cmd_options]) [options] svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(45) variancefactor(1)) For Grade 6, we need to specify that there are 90 replicate weights. Commands for grade 6 as follows
7.6 PASEC Analyses Examples
7.6.1 Calculating Mean Age
Let’s start by calculating the average age of students in Grade 2 in Benin. The age variable is qe22. Let’s inspect the ID_PAYS and qe22 variables.
Country identifier | Freq. Percent Cum.
---------------------------------+-----------------------------------
1. Benin | 1,654 7.54 7.54
2. Burkina Faso | 1,884 8.59 16.13
3. Burundi | 1,664 7.59 23.72
4. Cameroun | 1,780 8.12 31.83
5. Congo | 1,553 7.08 38.91
6. Cote D'Ivoire | 1,332 6.07 44.99
7. Gabon | 1,157 5.28 50.26
8. Guinee | 1,086 4.95 55.21
9. Madagascar | 1,883 8.59 63.80
10. Niger | 1,730 7.89 71.69
11. Democratic Republic of Congo | 1,050 4.79 76.47
12. Senegal | 1,341 6.11 82.59
13. Chad | 1,727 7.87 90.46
14. Togo | 2,092 9.54 100.00
---------------------------------+-----------------------------------
Total | 21,933 100.00
Student age |
[4–16] | Freq. Percent Cum.
-------------------+-----------------------------------
4 | 2 0.01 0.01
5 | 83 0.38 0.39
6 | 1,295 5.90 6.29
7 | 5,467 24.93 31.22
8 | 6,923 31.56 62.78
9 | 4,261 19.43 82.21
10 | 2,099 9.57 91.78
11 | 800 3.65 95.43
12 | 458 2.09 97.52
13 | 189 0.86 98.38
14 | 73 0.33 98.71
15 | 41 0.19 98.90
16 | 19 0.09 98.98
17 | 8 0.04 99.02
18 | 1 0.00 99.02
19 | 1 0.00 99.03
20 | 1 0.00 99.03
22 | 1 0.00 99.04
99. Missing | 211 0.96 100.00
-------------------+-----------------------------------
Total | 21,933 100.00
NOTE: Benin is coded as \(1\). Missing age values are coded as \(99\). Because \(99\) is a placeholder used to indicate missing data rather than a learner’s actual age, these observations should be excluded from analyses involving age. Failure to do so will produce misleading results. To calculate the average age of students in Grade \(2\) in Benin, the syntax is as follows:
*Calculating average age in Benin
repest SVY if ID_PAYS==1&qe22<99, estimate(mean qe22) svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(45) variancefactor(1)) (file C:\Users\cash\AppData\Local\Temp\ST_912c_000005.tmp not found)
file C:\Users\cash\AppData\Local\Temp\ST_912c_000005.tmp saved
_pooled.
: _pooled
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
qe22 | 7.56647 .0598516 126.42 0.000 7.449163 7.683777
------------------------------------------------------------------------------
Because age is an observed variable rather than a plausible-value variable, repest uses only the sampling variance derived from the replicate weights. The reported standard error therefore reflects uncertainty arising from the sample design but not measurement uncertainty.
The estimated mean age of Grade 2 students in Benin is \(7.57\) years (\(95\%\) CI: \(7.45–7.68\)). This reflects the substantial grade repetition and late entry common in the region, with many students aged 9–11 also enrolled in Grade 2.
PASEC provides replicate weights specifically so that users can reproduce the official variance estimation procedure.
If the public data contain:
the final sampling weight,
the sampling strata,
the primary sampling units (PSUs),
and these accurately reflect the actual design, then
will generally produce design-consistent standard errors.
However, this is not always possible as
The public design variables may be incomplete or masked.
The weighting process may include calibration steps not reflected in the released design variables.
Furthermore, using ordinary survey declarations with svyset will generally not reproduce the official PASEC standard error.
Stata’s svyset can be configured to use the paired jackknife replicate weights, producing the standard errors as official PASEC results.
The syntax to generate the same result is
7.6.2 Calculating Mean Mathematics Proficiency
Now we will use the \(5\) plausible values to estimate mean mathematics scores in Benin.
Plausible Values are a set of multiple imputations. The repest package automatically recognises plausible values when the variable name contains the \(@\) symbol.
For example:
tells repest to:
-analyse all five mathematics plausible values;
-combine results appropriately; and
-calculate standard errors that reflect both sampling and measurement uncertainty.
This allows researchers to obtain valid estimates without having to implement the multiple-imputation calculations manually.
*Calculating mean mathematics scores in Benin
repest SVY if ID_PAYS==1, estimate(mean MATHS_PV@) svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(45) variancefactor(1))(file C:\Users\cash\AppData\Local\Temp\ST_a0a0_000005.tmp not found)
file C:\Users\cash\AppData\Local\Temp\ST_a0a0_000005.tmp saved
_pooled.....
: _pooled
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 525.0697 7.159119 73.34 0.000 511.038 539.1013
------------------------------------------------------------------------------
The estimated mean mathematics proficiency in Benin is \(525\) points. The standard error of \(7.16\) reflects both sampling uncertainty and uncertainty arising from the plausible values. The confidence interval indicates that the population mean is likely to lie between \(511\) and \(539\) points.
PASEC proficiency scales are scaled using IRT procedures and should primarily be interpreted comparatively. The absolute value of the scale has no substantive meaning independent of the proficiency framework.
Unlike age, mathematics proficiency is represented by five plausible values. repest estimates the mean separately for each plausible value, combines the five estimates using multiple-imputation formulas, and then incorporates the replicate-weight variance to produce the final standard error.
An alternative Stata command for analysing plausible values is pv. The syntax to generate the same result is
Now let’s get means of mathematics and literacy for each country by including the by option.
*Means of mathematics and literacy for each country
repest SVY, estimate(mean MATHS_PV@ LECT_PV@) svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(45) variancefactor(1)) by(ID_PAYS) 1 2 3 4 5 6 7 8 9 10 11 12 13 14
(file C:\Users\cash\AppData\Local\Temp\ST_9148_000005.tmp not found)
file C:\Users\cash\AppData\Local\Temp\ST_9148_000005.tmp saved
1.....
ID_PAYS : 1
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 525.0697 7.159119 73.34 0.000 511.038 539.1013
LECT_PV_ | 524.8164 7.711942 68.05 0.000 509.7013 539.9315
------------------------------------------------------------------------------
2.....
ID_PAYS : 2
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 498.7166 8.233047 60.57 0.000 482.5801 514.8531
LECT_PV_ | 493.4861 9.747319 50.63 0.000 474.3817 512.5905
------------------------------------------------------------------------------
3.....
ID_PAYS : 3
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 614.3862 2.370557 259.17 0.000 609.74 619.0324
LECT_PV_ | 624.9706 4.534335 137.83 0.000 616.0835 633.8578
------------------------------------------------------------------------------
4.....
ID_PAYS : 4
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 516.7354 7.994531 64.64 0.000 501.0664 532.4044
LECT_PV_ | 522.1636 8.392591 62.22 0.000 505.7144 538.6127
------------------------------------------------------------------------------
5.....
ID_PAYS : 5
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 591.9041 6.298265 93.98 0.000 579.5598 604.2485
LECT_PV_ | 582.4071 7.480158 77.86 0.000 567.7462 597.0679
------------------------------------------------------------------------------
6.....
ID_PAYS : 6
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 522.5248 4.062804 128.61 0.000 514.5618 530.4877
LECT_PV_ | 516.5937 5.404587 95.58 0.000 506.0009 527.1865
------------------------------------------------------------------------------
7.....
ID_PAYS : 7
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 595.9217 9.399738 63.40 0.000 577.4986 614.3449
LECT_PV_ | 610.2501 14.4539 42.22 0.000 581.9209 638.5792
------------------------------------------------------------------------------
8.....
ID_PAYS : 8
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 519.3488 9.395843 55.27 0.000 500.9333 537.7644
LECT_PV_ | 469.039 10.25872 45.72 0.000 448.9322 489.1457
------------------------------------------------------------------------------
9.....
ID_PAYS : 9
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 549.7152 3.798524 144.72 0.000 542.2702 557.1602
LECT_PV_ | 568.8426 6.895106 82.50 0.000 555.3285 582.3568
------------------------------------------------------------------------------
10.....
ID_PAYS : 10
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 544.9191 6.358256 85.70 0.000 532.4571 557.381
LECT_PV_ | 534.6824 7.193762 74.33 0.000 520.5829 548.782
------------------------------------------------------------------------------
11.....
ID_PAYS : 11
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 567.7779 8.240901 68.90 0.000 551.626 583.9298
LECT_PV_ | 530.9825 10.5383 50.39 0.000 510.3278 551.6372
------------------------------------------------------------------------------
12.....
ID_PAYS : 12
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 563.4415 6.083019 92.63 0.000 551.519 575.364
LECT_PV_ | 557.1325 9.344315 59.62 0.000 538.818 575.447
------------------------------------------------------------------------------
13.....
ID_PAYS : 13
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 522.4416 6.840793 76.37 0.000 509.0338 535.8493
LECT_PV_ | 508.4999 7.8032 65.17 0.000 493.2059 523.7939
------------------------------------------------------------------------------
14.....
ID_PAYS : 14
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV_ | 489.3993 5.288786 92.54 0.000 479.0335 499.7651
LECT_PV_ | 474.8978 7.203616 65.92 0.000 460.779 489.0167
------------------------------------------------------------------------------
There is substantial variation in achievement across participating countries. Burundi records the highest Grade 2 mathematics score (\(614\) points), while Togo records the lowest (\(489\) points). Reading and mathematics performance are strongly correlated across countries, although some countries perform relatively better in one domain than the other. Formal statistical comparisons between countries should be conducted using regression models rather than visual inspection of the means alone.
Notice that standard errors vary considerably across countries - Burundi (country \(3\)) has notably smaller standard errors than countries such as Gabon (\(7\)) or Guinea (\(8\)). This reflects differences in the homogeneity of performance within schools and the precision of the sampling design in each country.
7.6.3 Moving beyond means: summary statistics with repest
repest has some built in commands that are very useful for analysing large scale assessments, including summary statistics that focus on the full distribution.
The summarize command generates point estimates and standard errors for a range of statistics beyond the mean, such as percentiles, standard deviations etc.
*First create an indicator for girl
gen girl=qe23==2 if qe23<9
*Beyond means - looking at the distribution of mathematics scores in Benin
repest SVY if ID_PAYS==1, estimate(summarize MATHS_PV@, stats(mean sd p5 p25 p50 p75 p95)) by(girl) svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(45) variancefactor(1)) (28 missing values generated)
0 1
(file C:\Users\cash\AppData\Local\Temp\ST_8294_000005.tmp not found)
file C:\Users\cash\AppData\Local\Temp\ST_8294_000005.tmp saved
0.....
girl : 0
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV__~n | 535.7972 7.346251 72.93 0.000 521.3988 550.1956
MATHS_PV__sd | 106.7786 7.377892 14.47 0.000 92.31815 121.239
MATHS_PV__p5 | 371.9733 9.424126 39.47 0.000 353.5023 390.4442
MATHS_PV_~25 | 460.3875 7.622528 60.40 0.000 445.4476 475.3274
MATHS_PV_~50 | 536.3236 5.948664 90.16 0.000 524.6644 547.9828
MATHS_PV_~75 | 598.2395 9.216155 64.91 0.000 580.1762 616.3028
MATHS_PV_~95 | 713.3866 33.84142 21.08 0.000 647.0587 779.7146
------------------------------------------------------------------------------
1.....
girl : 1
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV__~n | 513.0814 7.796753 65.81 0.000 497.8001 528.3628
MATHS_PV__sd | 101.0514 6.809575 14.84 0.000 87.70485 114.3979
MATHS_PV__p5 | 364.1102 10.49694 34.69 0.000 343.5366 384.6839
MATHS_PV_~25 | 441.1321 7.258763 60.77 0.000 426.9052 455.359
MATHS_PV_~50 | 507.5715 8.92481 56.87 0.000 490.0792 525.0638
MATHS_PV_~75 | 575.9452 12.09901 47.60 0.000 552.2316 599.6588
MATHS_PV_~95 | 682.0938 23.3877 29.16 0.000 636.2547 727.9328
------------------------------------------------------------------------------
Looking beyond average performance reveals important differences across the distribution. Boys in Benin have higher mathematics scores at the mean, median and upper tail of the distribution. For example, the median score among boys is approximately \(536\) points compared with \(508\) points among girls. The percentile estimates show that the gender gap is not confined to a small group of high-performing students but is evident across much of the achievement distribution.
The quantiletable command creates quantile tables.
*Quintile table for mathematics and language in Benin
repest SVY if ID_PAYS==1, estimate(quantiletable MATHS_PV@ LECT_PV@, nquantiles(5)) svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(45) variancefactor(1)) (file C:\Users\cash\AppData\Local\Temp\ST_29e4_000005.tmp not found)
file C:\Users\cash\AppData\Local\Temp\ST_29e4_000005.tmp saved
_pooled.....
: _pooled
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
MATHS_PV__q1 | 387.316 5.641749 68.65 0.000 376.2584 398.3737
MATHS_PV__q2 | 465.0336 5.777279 80.49 0.000 453.7104 476.3569
MATHS_PV__q3 | 522.5471 6.296739 82.99 0.000 510.2057 534.8885
MATHS_PV__q4 | 575.5634 7.330665 78.51 0.000 561.1956 589.9313
MATHS_PV__q5 | 675.3612 18.45676 36.59 0.000 639.1866 711.5358
LECT_PV__q1 | 417.5467 7.716463 54.11 0.000 402.4227 432.6707
LECT_PV__q2 | 474.0085 6.38251 74.27 0.000 461.499 486.5179
LECT_PV__q3 | 510.6889 7.230566 70.63 0.000 496.5172 524.8605
LECT_PV__q4 | 561.0914 11.73183 47.83 0.000 538.0974 584.0853
LECT_PV__q5 | 661.1535 20.08941 32.91 0.000 621.779 700.528
------------------------------------------------------------------------------
The quantile table divides the mathematics and reading distributions into five equal-sized groups. The reported values correspond to the score thresholds separating adjacent quintiles. For example, students scoring above approximately \(675\) points in mathematics belong to the highest-performing quintile in Benin.
Other built-in repest commands are means, freq and corr.
7.6.4 Estimating PASEC proficiency levels
PASEC defines a set of proficiency levels that describe what students know and can do at different points along the mathematics scale. In this example, we estimate the proportion of Grade 2 students in Benin who fall into each of the four PASEC mathematics proficiency levels.
The four proficiency levels for Grade 2 mathematics are defined by PASEC as follows:
| PASEC Proficiency Level | Score |
|---|---|
| 1 | \(< 400.34\) |
| 2 | \(400.34 – 489.03\) |
| 3 | \(489.03–577.73\) |
| 4 | \(>577.73\) |
These boundaries are set on the PASEC international scale and are the same for all countries, enabling cross-country comparisons. If we want to provide estimates for the PASEC defined levels of proficiency, we first need to create dummy variables for each level.
*Use PASEC defined levels
foreach var of varlist MATHS_PV1-MATHS_PV5{
gen prop1_`var'=`var'<=400.34 if `var'<.
gen prop2_`var'=`var'>400.34&`var'<=489.03 if `var'<.
gen prop3_`var'=`var'>489.03&`var'<577.73 if `var'<.
gen prop4_`var'=`var'>577.73 if `var'<.
}
repest SVY if ID_PAYS==1, estimate(mean prop1_MATHS_PV@ prop2_MATHS_PV@ prop3_MATHS_PV@ prop4_MATHS_PV@ ) svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(45) variancefactor(1)) (file C:\Users\cash\AppData\Local\Temp\ST_a784_000005.tmp not found)
file C:\Users\cash\AppData\Local\Temp\ST_a784_000005.tmp saved
_pooled.....
: _pooled
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
prop1_MATH~_ | .109777 .0135943 8.08 0.000 .0831327 .1364213
prop2_MATH~_ | .2709238 .0186251 14.55 0.000 .2344192 .3074284
prop3_MATH~_ | .3307259 .0207808 15.92 0.000 .2899964 .3714554
prop4_MATH~_ | .2885734 .0256526 11.25 0.000 .2382951 .3388516
------------------------------------------------------------------------------
The results indicate that approximately \(11\%\) of Grade 2 students in Benin fall below Level 1 (the lowest proficiency threshold), while around \(29\%\) reach the highest level. The largest share of students falls in the middle proficiency levels. Because these estimates are derived from plausible values, the reported standard errors incorporate both measurement and sampling uncertainty.
7.6.5 Testing for differences between groups
The table below highlights the distinction between two common types of comparisons in PASEC data: within-country comparisons (e.g., boys versus girls in the same country) and between-country comparisons (e.g., Benin versus Burkina Faso). Because these comparisons involve different survey structures, they require slightly different analytical approaches.
The examples that follow demonstrate how to test for differences in achievement between groups using regression models in repest.
| Comparison Type | What it measures | Survey design impact |
|---|---|---|
| Within-Country (e.g., Boys vs. Girls in Senegal) | The gap between two demographic subgroups who share the same sampling strata, schools, and teachers. | High covariance. Because groups are clustered together in the same schools, their errors are correlated. |
| Between-Country (e.g., Senegal vs Benin) | The gap between two entirely independent populations with completely separate sampling frames | Zero covariance. Sampling units in Country A have no mathematical relationship to sampling units in Country B |
In repest you should use over for within-country comparisons and by for between-country comparisons. You must not use over for countries. When using svy = "SVY", THE over() option is designed for within-country subgroup comparisons only. Using regression achieves the same goal and works correctly with the SVY option.
7.6.5.1 Testing for differences within countries - differences between boys and girls
Are there differences in mathematics scores between boys and girls? We can use linear regression to test for this.
*Test for differences between boys and girls mathematics scores in Benin
repest SVY if ID_PAYS==1, estimate(reg MATHS_PV@ girl) svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(45) variancefactor(1)) (file C:\Users\cash\AppData\Local\Temp\ST_7208_000005.tmp not found)
file C:\Users\cash\AppData\Local\Temp\ST_7208_000005.tmp saved
_pooled.....
: _pooled
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
girl | -22.71577 4.891148 -4.64 0.000 -32.30224 -13.12929
_cons | 535.7972 7.346251 72.93 0.000 521.3988 550.1956
------------------------------------------------------------------------------
The coefficient on girl provides the estimated difference in mathematics proficiency between girls and boys, together with the appropriate standard error and significance test. On average, girls’ scores are \(22.7\) points lower than boys in Benin.
What about differences in the proportion reaching PASEC level \(4\)?
*Test differences between boys and girls in percentage reaching PASEC level 4 in Benin
repest SVY if ID_PAYS==1, estimate(reg prop4_MATHS_PV@ girl) svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(45) variancefactor(1)) (file C:\Users\cash\AppData\Local\Temp\ST_48bc_000005.tmp not found)
file C:\Users\cash\AppData\Local\Temp\ST_48bc_000005.tmp saved
_pooled.....
: _pooled
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
girl | -.0850427 .0286047 -2.97 0.003 -.1411068 -.0289785
_cons | .3287349 .0268158 12.26 0.000 .2761768 .3812929
------------------------------------------------------------------------------
Girls are \(8.5\) percentage points less likely to reach the highest proficiency level in mathematics than boys in Benin.
7.6.5.2 Testing for differences between countries
Are there differences in mean mathematics scores between Benin and Burkina Faso?
*Test differences in mathematics scores between Benin and Burkina Faso
repest SVY if ID_PAYS<=2, estimate(reg MATHS_PV@ i.ID_PAYS) svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(45) variancefactor(1)) (file C:\Users\cash\AppData\Local\Temp\ST_6328_000005.tmp not found)
file C:\Users\cash\AppData\Local\Temp\ST_6328_000005.tmp saved
_pooled.....
: _pooled
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
_1b_ID_PAYS | 0 (omitted)
_2_ID_PAYS | -26.35308 11.12174 -2.37 0.018 -48.1513 -4.554867
_cons | 525.0697 7.159119 73.34 0.000 511.038 539.1013
------------------------------------------------------------------------------
Mathematics proficiency levels are \(26.4\) points lower in Burkina Faso compared to Benin, a statistically significant difference.
7.6.5.3 Testing for differences in percentiles
You cannot use standard linear regression (regress) because it only models the conditional arithmetic mean. Instead, to test differences in percentiles, you must use quantile regression (qreg) .
Quantile regression works exactly like linear regression, but instead of drawing a line through the average of the data, it draws a line through a specific percentile (quantile) of the data.
qreg, quantile(0.50)models the Median (\(50th\) Percentile).qreg, quantile(0.25)models the \(25th\) Percentile (Lower Bound).qreg, quantile(0.75)models the \(75th\) Percentile (Upper Bound).
Just like regress, qreg returns standard coefficients (\(\beta\)) and standard error matrices (\(e(b)\)).
If we want to test for differences in percentiles, we can use quantile regression
*Test for differences at the 25th and 75th percentile between boys and girls in Benin
repest SVY if ID_PAYS==1, estimate(qreg MATHS_PV@ girl, quantile(0.25)) svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(45) variancefactor(1))(file C:\Users\cash\AppData\Local\Temp\ST_4d9c_000005.tmp not found)
file C:\Users\cash\AppData\Local\Temp\ST_4d9c_000005.tmp saved
_pooled.....
: _pooled
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
girl | -19.25537 8.848027 -2.18 0.030 -36.59719 -1.91356
_cons | 460.3875 7.622528 60.40 0.000 445.4476 475.3274
------------------------------------------------------------------------------
repest SVY if ID_PAYS==1, estimate(qreg MATHS_PV@ girl, quantile(0.75)) svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(45) variancefactor(1)) (file C:\Users\cash\AppData\Local\Temp\ST_743c_000005.tmp not found)
file C:\Users\cash\AppData\Local\Temp\ST_743c_000005.tmp saved
_pooled.....
: _pooled
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
girl | -22.29431 9.654296 -2.31 0.021 -41.21638 -3.372237
_cons | 598.2395 9.216155 64.91 0.000 580.1762 616.3028
------------------------------------------------------------------------------
The gap between boys and girls increases from \(19.3\) points at the \(25th\) percentile to \(22.3\) points at the \(75th\) percentile.
7.6.6 Multivariate Regression
Now let’s use the Grade \(6\) data. There are \(90\) replicate weights so we will need to replace NREP(45) with NREP(90) in the svyparm() suboptions.
Grade 6 setup
repest SVY [if] [in] , estimate(cmd [,cmd_options]) [options] svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(90) variancefactor(1)) In this example, we’ll analyze the association between students’ math scores (MATHS_PV@) and their gender (qe63) and socio-economic status (ses) in Cote D’Ivoire. The ses variable is a composite socioeconomic status index derived from the PASEC household questionnaire.
(PASEC2019_LIVRET_D_FIN_PRIMAIRE)
* Apply English labels
do "data/PASEC2019_Grade6_EN_labels.do"
* Show both numeric values and value labels
numlabel, addqd26y: all characters numeric; replaced as int
(1339 missing values generated)
qd27y: all characters numeric; replaced as int
(1339 missing values generated)
qe61y: all characters numeric; replaced as int
qe61m: all characters numeric; replaced as byte
qe61d: all characters numeric; replaced as byte
(16 missing values generated)
*Association between mathematics scores and socio-economic status and gender
repest SVY if ID_PAYS==6, estimate(reg MATHS_PV@ ses girl) svyparm(NBpv(5) final_weight_name(rwgt0) rep_weight_name(rwgt) NREP(90) variancefactor(1)) results(add(r2 N)) (file C:\Users\cash\AppData\Local\Temp\ST_52dc_000005.tmp not found)
file C:\Users\cash\AppData\Local\Temp\ST_52dc_000005.tmp saved
_pooled.....
: _pooled
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
ses | 2.17009 .3883977 5.59 0.000 1.408845 2.931335
girl | -11.62055 2.855666 -4.07 0.000 -17.21755 -6.023549
_cons | 349.0048 18.6696 18.69 0.000 312.413 385.5965
e_r2 | .0786266 .0248018 3.17 0.002 .030016 .1272371
e_N | 3800 207.3596 18.33 0.000 3393.583 4206.417
------------------------------------------------------------------------------
The results(add(r2 N)) option: This option allows you to add extra statistics to the results table, which Stata stores by default. Here, we add R-squared (e_r2) - a measure of model fit, and number of observations (e_N). Note that repest stores these additional statistics internally as e_r2 and e_N, and these are the names shown in the output table.
The regression estimates the association between mathematics proficiency, socioeconomic status (SES), and gender among Grade \(6\) students in Côte d’Ivoire while correctly accounting for both plausible values and the PASEC replicate-weight design.
The results indicate a strong positive association between socioeconomic status and mathematics achievement. A one-unit increase in the SES index is associated with an increase of approximately \(2.17\) points in mathematics proficiency (coefficient \(= 2.17\), \(p < 0.001\)). This relationship is statistically significant, suggesting that students from more advantaged socioeconomic backgrounds tend to perform better in mathematics.
The coefficient on the female indicator (girl) is \(-11.62\) (\(p < 0.001\)), indicating that girls score, on average, about 12 points lower than boys with similar socioeconomic status. Because SES is included in the model, this estimated gender gap reflects differences between boys and girls after accounting for socioeconomic background.
The constant term (\(349\)) represents the predicted mathematics score for a boy with an SES value of zero. While the intercept is necessary for estimation, it is typically of limited substantive interest because an SES value of zero may not correspond to a meaningful student profile.
Overall, the results suggest that both socioeconomic status and gender are important correlates of mathematics achievement in Côte d’Ivoire. Higher socioeconomic status is associated with better performance, while girls perform less well than boys on average, even after controlling for socioeconomic differences.
It is important to remember that these coefficients describe associations rather than causal effects. The regression does not imply that increasing a student’s socioeconomic status by one unit would necessarily increase their mathematics score by \(2.17\) points. Other factors associated with socioeconomic status, such as school quality, parental education, household resources, and learning opportunities, may also contribute to the observed relationship.