Filtered by tag: Measurement Remove Filter

Estimating Treatment Effects with the Explanatory Item Response Model

How Much do Scoring Methods Matter for Causal Inference?

The manner in which student outcomes are measured, scored, and analyzed often receives too little attention in randomized experiments. In this study, we aimed to explore the consequences of different scoring approaches for causal inference on test score data. We compared the performance of four methods, Classical Test Theory (CTT) sum scores, CTT mean scores, item response theory (IRT) scores, and the Explanatory Item Response Model (EIRM). In contrast to the CTT- and IRT-based approaches that score the test and estimate treatment effects in two separate steps, the EIRM is a latent variable model that allows for simultaneous estimation of student ability and the treatment effect. The EIRM has a long history in psychometric research, but applications to empirical causal inference settings are rare. Our results show that which model performs best depends on the context.

How to Read this Chart: Statistical power (y-axis) by missing item response rate (x-axis) and estimation method (color and shape) shows that the relative performance of each approach depends on the context. The EIRM and IRT-based scores are more robust to missing data and provide the most benefits to power when the latent trait is heteroskedastic. Legend: skew = latent trait is skewed, het = latent trait is heteroskedastic, mar = item responses are missing at random, sum = sum score, mean = mean score, 1PL = IRT theta score, EIRM = explanatory item response model.

Comparative Model Performance

Read More

The Impact of a Standards-based Grading Intervention on Ninth Graders’ Mathematics Learning

What is Standards-based Grading?

Typically, U.S. classroom teachers use some tests and assignments purely for summative purposes, recording scores indelibly to be used in a weighted average that determines final grades. In contrast, under a standards-based grading system the teacher uses such assessment evidence both to evaluate the extent to which a student is proficient in each of the course’s learning outcomes at that particular moment in time (summative assessment), and then to provide students with personalized feedback designed to guide further learning (formative assessment). A key feature of standards-based grading is that students are then given opportunities to do further work, at home or in school, and to be reassessed for full credit. In other words, summative assessments become formative tools designed to promote further learning, not just markers of how much students have learned already.

How did we conduct this study?

We conducted a cluster randomized controlled trial, recruiting 29 schools and randomly assigning half (14 schools) to a Treatment condition, and half (15 schools) to a Control condition.  Treatment schools implemented the standards-based grading program, called PARLO, in their ninth-grade algebra and geometry classrooms, and Control schools proceeded with business-as-usual. In our participating districts, instruction to learning standards and implementation of formative assessment were already commonly in use. Consequently, the PARLO program focused on implementing two necessary components of standards-based grading. The first was Mastery: students were rated as not-yet-proficient, proficient, or high-performance on each learning outcome, and final grades were computed using a formula based on the number of proficient and the number of high-performance learning outcomes. The second was Reassessment: after providing evidence that they had done further studying, any student could be reassessed for full credit on any learning outcome.

Read More

Item Response Theory Models for Difference-in-Difference Estimates (and Whether They Are Worth the Trouble)

When randomized control trials are not possible, quasi-experimental methods like Regression Discontinuity and Difference-in-Difference (DiD) often represent the best alternatives for high quality evaluation. Researchers using such methods frequently conduct exhaustive robustness checks to make sure the assumptions of the model are met, and that results aren’t sensitive to specific choices made in the analysis process. However, often there is less thought applied to how the outcomes for many quasi-experimental studies are created. For example, in studies that rely on survey data, scores may be created by adding up the item responses to produce total scores, or achievement tests may rely on scores produced by test vendors. In this study, several item response theory (IRT) models specific to the DiD design are presented to see if they improve on simpler scoring approaches in terms of the bias and statistical significance of impact estimates.

Why might using a simple scoring approach do harm in the quasi-experimental/DiD context?

While most researchers are aware that measurement error can impact the precision of treatment effect estimates, they may be less aware that measurement model misspecification can introduce bias into scores and, thereby, treatment effect estimates. Total/sum scores do not technically involve a measurement model, and therefore may seem almost free of assumptions. But in fact, they resemble a constrained measurement model that oftentimes makes unsupported assumptions, including that all items should be given the same weight when producing a score. For instance, on a depression survey, total scores would assume that items asking about trouble sleeping and self-harm should get the same weight in the score. Giving all items the same weight can bias scores. For example, if patterns of responses differ between treated and control groups, faulty total score assumptions could bias treatment effect estimates and mute variability in the outcome researchers wish to quantify.

What decisions involved in more sophisticated scoring approaches impact treatment estimates?

Read More

The AIC and aBIC Work Best For Identiying the Correct Number of Profiles in Latent Transition Analysis Applied to Typical Educational Settings

Peter A. Edelsbrunner, Maja Flaig, Michael Schneider

PDF Version

How can we best tell how many different learning patterns there are in our data?

Latent transition analysis is used to describe different learner patterns. However, it is often hard to tell how many patterns there are. Is there a pattern of learners who have little knowledge, another pattern of learners with a specific misconception, and another pattern of learners who have properly understood everything that we tried to teach them? Or are there some of these patterns but not all, or even additional ones? This is really hard to tell, and different indicators (called “relative fit indices”) are available for helping us determinate how many patterns there really are. We compare the performance of several relative fit indices. We find that the Bayesian information criterion (BIC), which is commonly used to determine the number of learning patterns, is not very accurate in finding the right number of patterns in comparison to other indices.

Read More

Design and Analytic Features for Reducing Biases in Skill-Building Intervention Impact Forecasts

Daniela Alvarez-Vargas, Sirui Wan, Lynn S. Fuchs, Alice Klein, & Drew H. Bailey

PDF Version

Despite policy relevance, long term evaluations of educational interventions are rare relative to the amount of end of treatment evaluations. A common approach to this problem is to use statistical models to forecast the long-term effects of an intervention based on the estimated shorter term effects. Such forecasts typically rely on the correlation between children’s early skills (e.g., preschool numeracy) and medium-term outcomes (e.g., 1st grade math achievement), calculated from longitudinal data available outside the evaluation. This approach sometimes over- or under-predicts the longer-term effects of early academic interventions, raising concerns about how best to forecast the long-term effects of such interventions. The present paper provides a methodological approach to assessing the types of research design and analysis specifications that may reduce biases in such forecasts.

What did we do?

Read More

A Framework for addressing Instrumentation Biases when using Observation Systems as Outcome Measures in Instructional Interventions

Mark White, Bridget Maher, Brian Rowan

PDF Version

Many educational interventions seek to directly shift instructional practice. Observation systems are used to measure changes in instructional practice resulting from such interventions. However, the complexity of observation systems creates the risk of instrumentation biases. Instrumentation bias is bias resulting from changes to the ways that an instrument functions across conditions (e.g., from pre-test to post-test or between control and intervention conditions). For example, teachers could intentionally show off intervention-specific practices whenever they are observed, but not otherwise use those practices. Alternatively, an instructional intervention could shift instruction in ways that increase observation scores without impacting the underlying instructional dynamics that support student learning.

This conceptual paper with a case study exemplar provides a validity framework for using observation systems to evaluate the impact of interventions. Inferences about an intervention’s impact generally involve determining whether a teaching practice has changed within some setting. Observation scores, the evidence for these conclusions, are specific raters’ views of how a rubric would describe observed lessons. The conclusions are far more generalized than the observation scores. The framework (see Figure below) systematically breaks down the processes necessary to operationalize an aspect of teaching practice and sample from a setting to obtain observation scores that can be generalized to draw conclusions.

Read More

How to measure quality of delivery: Focus on teaching practices that help students to develop proximal outcomes

Diego Catalán Molina, Tenelle Porter, Catherine Oberle, Misha Haghighat, Afiya Fredericks, Kristen Budd, Sylvia Roberts, Lisa Blackwell, and Kali H. Trzesniewski

PDF Version

How much students benefit from a school intervention depends on how well the intervention is delivered

When a new curriculum is introduced at a school, the quality of its implementation will vary across teachers. Does this matter? In this study, teachers varied widely in how well they implemented a 20-lesson social and emotional blended-learning curriculum. Teachers who delivered the program at higher quality, for example, encouraged student reflection and participation and provided feedback to students on how to improve skills. Teachers who delivered the program at higher quality had students with higher levels of motivation (growth mindset, effort beliefs, and learning goals) at the end of the program compared to teachers who delivered at lower quality.

Read More

Modeling and Comparing Seasonal Trends in Interim Achievement Data

James Soland & Yeow Meng Thum

PDF Version


Interim achievement tests are often used to monitor student and school performance over time. Unlike end-of-year achievement tests used for accountability, interim tests are administered multiple times per year (e.g., Fall, Winter, and Spring) and vary across schools in terms of when in the school year students take them. As a result, scores reflect seasonal patterns in achievement, including summer learning loss. Despite the prevalence of interim tests, few statistical models are designed to answer questions commonly asked with interim test data (e.g., Do students whose achievement grows the most over several years, tend to experience below-average summer loss?). In this study we compare the properties of three growth models that can be used to examine interim test data.

Read More

Performance Evaluations as a Measure of Teacher Effectiveness When Implementation Differs

James Cowan, Dan Goldhaber, Roddy Theobald

PDF Version


We use statewide data from Massachusetts to investigate the school role in teacher evaluation. Schools classify most teachers as proficient but differ substantially in how frequently they assign other ratings. We show these patterns are driven by differences in the application of standards across schools, not by differences in the distribution of teacher quality.

Read More

Gather-Narrow-Extract: A Framework for Studying Local Policy Variation Using Web-Scraping and Natural Language Processing

Kylie L. Anglin

PDF Version

Many education policy decisions are made at the local level. School districts make policies regarding hiring, resource allocation, and day-to-day operations. However, collecting data on local policy decisions has traditionally been expensive and time-consuming, sometimes leading researchers to leave important research questions unanswered.

This paper presents a framework for efficiently identifying and processing local policy documents posted online – documents like staff manuals, union contracts, and school improvement plans – using web-scraping and natural language processing.

Read More

Mitigating Illusory Results through Preregistration in Education

Summary by: Claire Chuter

PDF Version

Good researchers thoroughly analyze their data, right? Practices like testing the right covariates, running your analyses in multiple ways to find the best fitting model, screening for outliers, and testing for mediation or moderation effects are indeed important practices… but with a massive caveat. The aggregation of many of these rigorous research practices (as well as some more dubious ones) can lead to what the authors call “illusory results” – results that seem real but are unlikely to be reproduced. In other words, implementation of these common practices (see Figure 1 in the article), often leads researchers to run multiple analytic tests which may unwittingly inflate their chances of stumbling upon a significant finding by chance.

Potential Solutions

Read More

The Methodological Challenges of Measuring Institutional Value-added in Higher Education

Tatiana Melguizo, Gema Zamarro, Tatiana Velasco, and Fabio J. Sanchez

PDF Version

Assessing the quality of higher education is hard but there is growing pressure for governments to create a ranking system for institutions that can be used for assessment and funding allocations.  Such a system, however, would require a reliable methodology to fairly assess colleges using a wide variety of indicators. Countries with centralized governance structures have motivated researchers to develop “value-added” metrics of colleges’ contributions to student outcomes that can be used for summative assessment (Coates, 2009; Melguizo & Wainer, 2016; Shavelson et al. 2016). Estimating the “value-added” of colleges and programs, however, is methodologically challenging: first, high- and low-achieving students tend to self-select into different colleges– a behavior that if not accounted for, may yield to estimates that capture students’ prior achievement rather than colleges’ effectiveness at raising achievement; second, measures considering gains in student learning outcomes (SLOs) as indicators at the higher education level are scant. In our paper, we study these challenges and compare the methods used for obtaining value-added metrics in the context of higher education in Colombia.

How to best estimate value-added models in higher education?

Read More

Latent Profiles of Reading and Language and Their Association with Standardized Reading Outcomes in K-10th Grade

Barbara R Foorman, Yaacov Petscher, Christopher Stanley, & Adrea Truckenmiller

PDF Version

Differentiated instruction involves tailoring instruction to individual student’s learning needs. While critical to effective teaching, an understudied first step in differentiated instruction is understanding students’ learning profiles – that is, their strengths and weaknesses in knowledge and skills.  It is only after a student’s learning profile is understood that a teacher can individualize instruction. But how can educators best measure learning profiles to facilitate differentiated instruction?

Descriptive approaches such as informal reading inventories lack the psychometric rigor required for purposes of classification, placement, and monitoring growth.  However, quantitative approaches to classifying and clustering (i.e., grouping) students by skill classes and validating the clusters by relating them to standardized tests is a reliable tool for creating profiles. The objective of this study was twofold. First, to determine the profiles of reading and language skills that characterized 7,752 students in kindergarten through 10th grade. Second, to relate the profiles to standardized reading outcomes.

Read More