Estimating Treatment Effects with the Explanatory Item Response Model

by: Gilbert

Tags

Apr

05

How Much do Scoring Methods Matter for Causal Inference?

The manner in which student outcomes are measured, scored, and analyzed often receives too little attention in randomized experiments. In this study, we aimed to explore the consequences of different scoring approaches for causal inference on test score data. We compared the performance of four methods, Classical Test Theory (CTT) sum scores, CTT mean scores, item response theory (IRT) scores, and the Explanatory Item Response Model (EIRM). In contrast to the CTT- and IRT-based approaches that score the test and estimate treatment effects in two separate steps, the EIRM is a latent variable model that allows for simultaneous estimation of student ability and the treatment effect. The EIRM has a long history in psychometric research, but applications to empirical causal inference settings are rare. Our results show that which model performs best depends on the context.

How to Read this Chart: Statistical power (y-axis) by missing item response rate (x-axis) and estimation method (color and shape) shows that the relative performance of each approach depends on the context. The EIRM and IRT-based scores are more robust to missing data and provide the most benefits to power when the latent trait is heteroskedastic. Legend: skew = latent trait is skewed, het = latent trait is heteroskedastic, mar = item responses are missing at random, sum = sum score, mean = mean score, 1PL = IRT theta score, EIRM = explanatory item response model.

Comparative Model Performance

Statistical Power Analysis for Univariate Meta-Analysis: A Three-Level Model

by: Zhang & Konstantopoulos

Category: Methodology

Tags

Methods Power/MDE

Apr

05

Statistical power analysis has long been a requirement for researchers seeking funding from the Institute of Education Sciences (IES). As in all individual studies, power analysis is also important when conducting meta-analytic review studies to ensure that the study has sufficient ability to detect an overall treatment effect of interest across a large group of related studies. For example, suppose a meta-analytic review determines that a school intervention significantly improves student performance and has good power to detect that effect, then, researchers will have more confidence in further investing in, developing, and recommending the specific intervention for extensive usage. Calculating statistical power can also inform researchers as they design studies. For instance, power analysis can inform the necessary number of studies needed in their sample to detect an effect across all of those studies in a meta-analysis. This study extends prior research on power analysis for univariate meta-analysis and adds new aspects that facilitate the calculations of statistical power.

A three-level model in meta-analysis considers heterogeneity across research groups

There are a few common approaches to conduct meta-analysis. However, recent realizations suggest that the same authors often publish several studies in a certain topic, and thus may be represented many times in the meta-analysis. To address this issue, approaches to calculating statistical power in these studies should account for the repeated representation of the same study teams. Thus, in our study, we formally introduce methodology that adds third level units in the meta-analysis.

In the proposed three-level meta-analysis, the effect sizes are nested within studies, which in turn are nested within research groups of investigators (see the illustrative figure). Specifically, in this illustration, one effect size (e.g., ES 1) is extracted from each study (e.g., Study 1) and several studies (e.g., Study 1 to Study i) are linked to a research group (e.g., Research Group 1) because they are conducted by the same authors. The variance between these third level units (i.e., research groups) may influence the power of the meta-analysis. Consequently, the proposed three-level model takes into account the between-study (second level) and the between-research group (third level) variances and produces more accurate power estimates.

Item Response Theory Models for Difference-in-Difference Estimates (and Whether They Are Worth the Trouble)

by: Soland

Category: Methodology

Tags

Methods Impact Variation Causal Inference Measurement Difference-in-Difference

May

01

When randomized control trials are not possible, quasi-experimental methods like Regression Discontinuity and Difference-in-Difference (DiD) often represent the best alternatives for high quality evaluation. Researchers using such methods frequently conduct exhaustive robustness checks to make sure the assumptions of the model are met, and that results aren’t sensitive to specific choices made in the analysis process. However, often there is less thought applied to how the outcomes for many quasi-experimental studies are created. For example, in studies that rely on survey data, scores may be created by adding up the item responses to produce total scores, or achievement tests may rely on scores produced by test vendors. In this study, several item response theory (IRT) models specific to the DiD design are presented to see if they improve on simpler scoring approaches in terms of the bias and statistical significance of impact estimates.

Why might using a simple scoring approach do harm in the quasi-experimental/DiD context?

While most researchers are aware that measurement error can impact the precision of treatment effect estimates, they may be less aware that measurement model misspecification can introduce bias into scores and, thereby, treatment effect estimates. Total/sum scores do not technically involve a measurement model, and therefore may seem almost free of assumptions. But in fact, they resemble a constrained measurement model that oftentimes makes unsupported assumptions, including that all items should be given the same weight when producing a score. For instance, on a depression survey, total scores would assume that items asking about trouble sleeping and self-harm should get the same weight in the score. Giving all items the same weight can bias scores. For example, if patterns of responses differ between treated and control groups, faulty total score assumptions could bias treatment effect estimates and mute variability in the outcome researchers wish to quantify.

What decisions involved in more sophisticated scoring approaches impact treatment estimates?

Using a Multi-Site RCT to Predict Impacts for a Single Site: Do Better Data and Methods Yield More Accurate Predictions?

by: Olsen, Orr, Bell, Petraglia, Badillo-Goicoechea, Miyaoka, Stuart

Category: Methodology

Tags

Methods Impact Variation

Apr

17

Multi-site randomized controlled trials (RCTs) produce rigorous evidence on whether educational interventions “work.” However, principals and superintendents need evidence that applies to their students and schools. This paper examines whether the average impact of an intervention in a particular site—school or district—can be accurately predicted using evidence from a multi-site RCT.

What Methods Did the Study Use to Predict Impacts?

This paper used three methods to predict the average impact in individual sites: (1) the average of the impact estimates in the other sites, (2) lasso regression, and (3) Bayesian Additive Regression Trees (BART). Lasso and BART used a variety of moderators as predictors, including characteristics of participating students, participating schools, the intervention as implemented, and the counterfactual condition.

How Was the Accuracy of These Predictions Gauged?

Experimental Design and Statistical Power for Cluster Randomized Cost-Effectiveness Trials

by: Li, Dong, Maynard, Spybrook, & Kelcey

Category: Methodology

Tags

Methods Power/MDE Cost

Mar

21

Cluster randomized trials (CRTs) are commonly used to evaluate educational effectiveness. Recently there has been greater emphasis on using these trials to explore cost-effectiveness. However, methods for establishing the power of cluster randomized cost-effectiveness trials (CRCETs) are limited. This study developed power computation formulas and statistical software to help researchers design two- and three-level CRCETs.

Why are cost-effectiveness analysis and statistical power for CRCETs important?

Policymakers and administrators commonly strive to identify interventions that have maximal effectiveness for a given budget or aim to achieve a target improvement in effectiveness at the lowest possible cost (Levin et al., 2017). Evaluations without a credible cost analysis can lead to misleading judgments regarding the relative benefits of alternative strategies for achieving a particular goal. CRCETs link the cost of implementing an intervention to its effect and thus help researchers and policymakers adjudicate the degree to which an intervention is cost-effective. One key consideration when designing CRCETs is statistical power analysis. It allows researchers to determine the conditions needed to guarantee a strong chance (e.g., power > 0.80) of correctly detecting whether an intervention is cost-effective.

How to compute statistical power when designing CRCETs?

Conjuring power from a theory of change: The PWRD method for trials with anticipated variation in effects

by: Lycurgus, Hansen, & White

Category: Methodology

Tags

Methods RCT Moderation Power/MDE

Dec

16

Timothy Lycurgus, Ben B. Hansen, and Mark White

S	M	T	W	T	F	S
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2

S	M	T	W	T	F	S
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2

Quick Links

Calendar

Announcements

Our Sponsors

S	M	T	W	T	F	S
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2