Estimating Treatment Effects with the Explanatory Item Response Model

by: Gilbert

Tags

Apr

05

How Much do Scoring Methods Matter for Causal Inference?

The manner in which student outcomes are measured, scored, and analyzed often receives too little attention in randomized experiments. In this study, we aimed to explore the consequences of different scoring approaches for causal inference on test score data. We compared the performance of four methods, Classical Test Theory (CTT) sum scores, CTT mean scores, item response theory (IRT) scores, and the Explanatory Item Response Model (EIRM). In contrast to the CTT- and IRT-based approaches that score the test and estimate treatment effects in two separate steps, the EIRM is a latent variable model that allows for simultaneous estimation of student ability and the treatment effect. The EIRM has a long history in psychometric research, but applications to empirical causal inference settings are rare. Our results show that which model performs best depends on the context.

How to Read this Chart: Statistical power (y-axis) by missing item response rate (x-axis) and estimation method (color and shape) shows that the relative performance of each approach depends on the context. The EIRM and IRT-based scores are more robust to missing data and provide the most benefits to power when the latent trait is heteroskedastic. Legend: skew = latent trait is skewed, het = latent trait is heteroskedastic, mar = item responses are missing at random, sum = sum score, mean = mean score, 1PL = IRT theta score, EIRM = explanatory item response model.

Comparative Model Performance

The Impact of a Standards-based Grading Intervention on Ninth Graders’ Mathematics Learning

by: Kramer, Posner, Browman, Lawrence, Roem, & Krier

Category: Evaluation

Tags

Intervention RCT Causal Inference Measurement Assessment

Apr

05

What is Standards-based Grading?

Typically, U.S. classroom teachers use some tests and assignments purely for summative purposes, recording scores indelibly to be used in a weighted average that determines final grades. In contrast, under a standards-based grading system the teacher uses such assessment evidence both to evaluate the extent to which a student is proficient in each of the course’s learning outcomes at that particular moment in time (summative assessment), and then to provide students with personalized feedback designed to guide further learning (formative assessment). A key feature of standards-based grading is that students are then given opportunities to do further work, at home or in school, and to be reassessed for full credit. In other words, summative assessments become formative tools designed to promote further learning, not just markers of how much students have learned already.

How did we conduct this study?

We conducted a cluster randomized controlled trial, recruiting 29 schools and randomly assigning half (14 schools) to a Treatment condition, and half (15 schools) to a Control condition. Treatment schools implemented the standards-based grading program, called PARLO, in their ninth-grade algebra and geometry classrooms, and Control schools proceeded with business-as-usual. In our participating districts, instruction to learning standards and implementation of formative assessment were already commonly in use. Consequently, the PARLO program focused on implementing two necessary components of standards-based grading. The first was Mastery: students were rated as not-yet-proficient, proficient, or high-performance on each learning outcome, and final grades were computed using a formula based on the number of proficient and the number of high-performance learning outcomes. The second was Reassessment: after providing evidence that they had done further studying, any student could be reassessed for full credit on any learning outcome.

The Uncertain Role of Educational Software in Remediating Student Learning

by: Sidrah Baloch, Thomas J. Kane, Ethan Scherer, Douglas O. Staiger

Category: Evaluation

Tags

Intervention Causal Inference Math Technology

Sep

27

What is the potential of educational software for remediation?

Educators must balance the need to remediate students who are performing behind grade level with their obligation to teach grade-appropriate content to all. Educational software programs could help them strike this balance by incorporating below-grade-level content into an existing curriculum, allowing students to learn at their own pace while remaining in the same classroom as their peers. If effective, this practice could save school systems the high costs of more intensive remedial interventions like high-dosage tutoring, summer school, extra coursework, and grade retention.

How did this study examine the effectiveness of educational software for remediating below-grade-level students?

This study estimates the causal effects of providing low-performing students in grades 3-6 with below-grade-level math content via an online software program. Students who scored below a designated cutoff on a prior-year math assessment were assigned a modified version of the software program. The modified software included below-grade-level content before the grade-level material. Students who scored above the cutoff received only the grade-level curriculum. We examined whether receiving the modified curriculum affected students’ completion of grade-level learning objectives, pre- and post-objective quiz scores, and math test scores

Item Response Theory Models for Difference-in-Difference Estimates (and Whether They Are Worth the Trouble)

by: Soland

Category: Methodology

Tags

Methods Impact Variation Causal Inference Measurement Difference-in-Difference

May

01

When randomized control trials are not possible, quasi-experimental methods like Regression Discontinuity and Difference-in-Difference (DiD) often represent the best alternatives for high quality evaluation. Researchers using such methods frequently conduct exhaustive robustness checks to make sure the assumptions of the model are met, and that results aren’t sensitive to specific choices made in the analysis process. However, often there is less thought applied to how the outcomes for many quasi-experimental studies are created. For example, in studies that rely on survey data, scores may be created by adding up the item responses to produce total scores, or achievement tests may rely on scores produced by test vendors. In this study, several item response theory (IRT) models specific to the DiD design are presented to see if they improve on simpler scoring approaches in terms of the bias and statistical significance of impact estimates.

Why might using a simple scoring approach do harm in the quasi-experimental/DiD context?

While most researchers are aware that measurement error can impact the precision of treatment effect estimates, they may be less aware that measurement model misspecification can introduce bias into scores and, thereby, treatment effect estimates. Total/sum scores do not technically involve a measurement model, and therefore may seem almost free of assumptions. But in fact, they resemble a constrained measurement model that oftentimes makes unsupported assumptions, including that all items should be given the same weight when producing a score. For instance, on a depression survey, total scores would assume that items asking about trouble sleeping and self-harm should get the same weight in the score. Giving all items the same weight can bias scores. For example, if patterns of responses differ between treated and control groups, faulty total score assumptions could bias treatment effect estimates and mute variability in the outcome researchers wish to quantify.

What decisions involved in more sophisticated scoring approaches impact treatment estimates?

The Impact of a Virtual Coaching Program to Improve Instructional Alignment to State Standards

by: Smith, Garet, Song, Atchison, & Porter

by: Weiss

Category: uncategorized

Tags

Methods RCT Postsecondary Causal Inference

Oct

15

Michael Weiss

PDF Version

Beware! Teacher effects could mess up your individually randomized trial! Or such is the message of this paper focusing on what happens if you have individual randomization, but teachers are not randomly assigned to experimental groups.

The key idea is that if your experimental groups are systematically different in teacher quality, you will be estimating a combined impact of getting a good/bad teacher on top of the impact of your intervention.

S	M	T	W	T	F	S
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2

Estimating Treatment Effects with the Explanatory Item Response Model

Read More

The Impact of a Standards-based Grading Intervention on Ninth Graders’ Mathematics Learning

Read More

The Uncertain Role of Educational Software in Remediating Student Learning

Read More

Item Response Theory Models for Difference-in-Difference Estimates (and Whether They Are Worth the Trouble)

Read More

The Impact of a Virtual Coaching Program to Improve Instructional Alignment to State Standards

What is the virtual coaching program tested in this study?

Read More

A recipe for disappointment: policy, effect size and the winner’s curse

Read More

Mitigating Illusory Results through Preregistration in Education

Read More

Partially Identified Treatment Effects for Generalizability

Read More

Bounding, an accessible method for estimating principal causal effects, examined and explained

Read More

The Implications of Teacher Selection and the Teacher Effect in Individually Randomized Group Treatment Trials

Read More

Quick Links

Calendar

Announcements

Our Sponsors

S	M	T	W	T	F	S
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2

S	M	T	W	T	F	S
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2

S	M	T	W	T	F	S
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2