Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Please note that all times are shown in the time zone of the conference. The current conference time is: 10th May 2025, 11:39:27 EEST

 
 
Session Overview
Session
09 SES 03 B: Challenges in Educational Measurement Practices
Time:
Tuesday, 27/Aug/2024:
17:15 - 18:45

Session Chair: Elena Papanastasiou
Location: Room 012 in ΧΩΔ 02 (Common Teaching Facilities [CTF02]) [Ground Floor]

Cap: 56

Paper Session

Show help for 'Increase or decrease the abstract text size'
Presentations
09. Assessment, Evaluation, Testing and Measurement
Paper

A Peculiarity in Educational Measurement Practices

Mark White

University of Oslo, Norway

Presenting Author: White, Mark

This paper discusses a peculiarity in institutionalized educational measurement practices. Namely, an inherent contradiction between guidelines for how scales/tests are developed and how those scales/tests are typically analyzed.

Standard guidelines for developing scales/tests emphasize the need to identify the intended construct and select items to capture the construct’s full breadth, leading items (or subsets of items) to target different aspects of the construct. This occurs in test development through specifying the test’s content domain along with a blueprint allocating items to content domains, item formats, and/or cognitive demand levels (AERA, APA, & NCME, 2014, ch. 4). Similarly, scale development guidelines emphasize identifying sub-facets of constructs, such that items can be targeted to capture each sub-facet, ensuring that the full construct is measured (e.g., Gehlbach & Brinkworth, 2011; Steger et al., 2022). These guidelines intentionally ensure that items (or subsets of items) contain construct-relevant variation that is not contained in every other item (e.g., it is recommended to include geometry-related items when measuring math ability because such items capture construct-relevant variation in math ability that is not present in, e.g., algebra-related items; c.f., Stadler et al., 2021).

At the same time, scales/tests are typically analyzed with reflective measurement models (Fried, 2020). I focus on factor models for simplicity, but the same basic point applies to item-response theory models, as a reparameterization of item-response theory models to non-linear factor models would show (McDonald, 2013). In the unidimensional factor model, the item, Xip, is modelled as Xip=(alpha_i+lambda_i*F_p)+e_p, where i represents items, p is persons, alpha_i is an item intercept, lambda_i is a factor loading, F_p is the latent factor construct, and e_p is the person-specific error. The (alpha_i+lambda_i*F_p) term can be understood as an item-specific linear rescaling of the latent factor (that is on an arbitrary scale) to the item’s scale, just as one might rescale a test to obtain more interpretable scores. The factor model, then, consists of two parts, the rescaled factor and the error term. Since each item is defined as containing a rescaling of the factor and this is the only construct-relevant variation contains in items, each item must contain all construct-related variation (i.e., all changes in the construct are reflected in each item). Note that these points are conceptual, stemming from the mathematics of the factor model, not claims about the results of fitting models to specific data.

There is a contradiction here: Scales/tests are intentionally designed so that each item (or subset of items) captures unique, construct-related variation, but analyses are conducted under the assumption that no item (nor subset of items) contain unique, construct-related variation. To have such a clear contradiction baked into the institutionalized practices of measurement in the educational and social sciences is peculiar indeed.


Methodology, Methods, Research Instruments or Sources Used
This is a discussion paper so there are no true methods per se. The analyses are based on careful study of institutionalized guidelines for how to construct tests and survey scales and the typical approaches for analyzing data from tests and survey scales. The presentation will focus on reviewing direct quotes from these guidelines in order to build the case that there is an inbuilt contradiction to baked into current “best practices” in measuring in the educational sciences. I will then present a logical analysis of the implications for this contradiction. Drawing on past and recent critiques of reflective modeling, I will propose that this contradiction persists because reflective models provide a clear and direct set of steps to support a set of epistemological claims about measuring the intended construct reliably and invariantly. I will then argue that, given the contradiction, these epistemological claims are not strongly supported through appeal to reflective modelling approaches. Rather, this contradiction leads to breakdowns in scientific practice (White & Stovner, 2023).
Conclusions, Expected Outcomes or Findings
The reflective measurement models that are used to evaluate the quality of educational measurement are built using a set of assumptions that contradict those used to build tests and scales. This peculiarity leaves the field evaluating the quality of measurement using models that, by design, do not fit the data to which they are applied. This raises important questions about the accuracy of claims that one has measured a specific construct, that measurement is reliably, and/or that measurement is or is not invariant. There is a need for measurement practices to shift to create alignment between the ways that tests/scales are created and how they are analyzed. I will discuss new modelling approaches that would facilitate this alignment (e.g., Henseler et al., 2014; Schuberth, 2021). However, questions of construct validity, reliability, and invariant measurement become more difficult when moving away from the reflective measurement paradigm.
References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. American Educational Research Association. http://www.apa.org/science/programs/testing/standards.aspx
Fried, E. I. (2020). Theories and Models: What They Are, What They Are for, and What They Are About. Psychological Inquiry, 31(4), 336–344. https://doi.org/10.1080/1047840X.2020.1854011
Gehlbach, H., & Brinkworth, M. E. (2011). Measure Twice, Cut down Error: A Process for Enhancing the Validity of Survey Scales. Review of General Psychology, 15(4), 380–387. https://doi.org/10.1037/a0025704
Henseler, J., Dijkstra, T. K., Sarstedt, M., Ringle, C. M., Diamantopoulos, A., Straub, D. W., Ketchen, D. J., Hair, J. F., Hult, G. T. M., & Calantone, R. J. (2014). Common Beliefs and Reality About PLS: Comments on Rönkkö and Evermann (2013). Organizational Research Methods, 17(2), 182–209. https://doi.org/10.1177/1094428114526928
Maraun, M. D. (1996). The Claims of Factor Analysis. Multivariate Behavioral Research, 31(4), 673–689. https://doi.org/10.1207/s15327906mbr3104_20
McDonald, R. P. (2013). Test Theory: A Unified Treatment. Psychology Press.
Schuberth, F. (2021). The Henseler-Ogasawara specification of composites in structural equation modeling: A tutorial. Psychological Methods, 28(4), 843–859. https://doi.org/10.1037/met0000432
Stadler, M., Sailer, M., & Fischer, F. (2021). Knowledge as a formative construct: A good alpha is not always better. New Ideas in Psychology, 60, 1-14. https://doi.org/hqcg
Steger, D., Jankowsky, K., Schroeders, U., & Wilhelm, O. (2022). The Road to Hell Is Paved With Good Intentions: How Common Practices in Scale Construction Hurt Validity. Assessment, 1-14. https://doi.org/10.1177/10731911221124846
White, M., & Stovner, R. B. (2023). Breakdowns in Scientific Practices: How and Why Practices Can Lead to Less than Rational Conclusions (and Proposed Solutions). OSF Preprints. https://doi.org/10.31219/osf.io/w7e8q


09. Assessment, Evaluation, Testing and Measurement
Paper

Exploring Mode-of-Delivery Effect in Reading Achievement in Sweden: A study using PIRLS 2021 data

Elpis Grammatikopoulou, Stefan Johansson, Monica Rosén

Göteborgs Universitet, Sweden

Presenting Author: Grammatikopoulou, Elpis

Reading literacy is considered an essential factor for learning and personal development (Mullis & Martin, 2015). International assessments like PIRLS are tracking trends and shaping literacy policies. They seek to evaluate global student learning, offering crucial insights into educational performance to shape informed policy decisions. Given the ongoing technological expansion and innovation, a shift in delivery mode became an inevitable progression (Jerrim, 2018). PIRLS has adapted to these changes, introducing the digital format in 2016 (ePIRLS) and achieving a significant milestone in 2021 with the partial transition to a digital assessment, through a web-based digital delivery system. Digital PIRLS included a variety of reading texts presented in an engaging and visually attractive format that were designed to motivate students to read and interact with the texts and answer comprehension questions. While considerable effort has been invested to ensure content similarity between the two formats, variations persist due to the distinct modes of administration (Almaskut et al., 2023). This creates the need for further analysis and exploration to better understand the impact of these differences on the overall outcomes and effectiveness of the administered modes.

Previous research has highlighted the presence of a mode effect, varying in magnitude, when comparing paper-based and digital assessments (Jerrim, 2018; Kingston, 2009). Jerrim's (2018) analysis of PISA 2015 field trial data across Germany, Ireland, and Sweden indicates a consistent trend of students scoring lower in digital assessments compared to their counterparts assessed on paper. Furthermore, Kingston's meta-analysis (2009) indicates that, on average, elementary students score higher on paper and exhibit small effect sizes when transitioning from paper-based to digital reading assessments. On the other hand, PIRLS 2016 was administered both in paper and digitally in 14 countries, where students in nine countries performed better in digital assessments, while only in five countries did students perform better in paper (Grammatikopoulou et al., 2024).

Formulärets överkant

Additionally, research underscores the distinct consequences of printed and digital text on memory, concentration, and comprehension (Delgado et al., 2018; Baron, 2021). Furthermore, previous findings support the fact that there is variation when it comes to the factors influencing performance in these two modes. Time spent on internet and computer use for school was found as a significant predictor of digital assessments, but not of paper-based (Gilleece & Eivers, 2018).

The present study

Sweden was among the 26 countries out of 57 that administered the digital format in PIRLS 2021. Another paper-based text -replicated from PIRLS 2016- was also administered to a ‘bridge’ sample. To maintain consistency across formats, both digital PIRLS and paper PIRLS share identical content in terms of reading passages and questions. However, digital PIRLS utilizes certain features and item types that are not accessible in the traditional paper and pencil mode. The digital version showcased advantages such as operational efficiency and enhanced features, while maintaining content consistency with the paper format. The primary aim of the present study is to investigate a potential mode effect between digital and paper formats, if there, and explore any variations in reading achievement between the two formats. Despite advancements in digital assessment, there remains a gap in our understanding of how the shift from traditional paper-based assessments to digital formats may impact reading literacy outcomes. By delving into these potential differences, we aim to contribute valuable insights into the evolving landscape of educational assessments, informing educators, policymakers, and researchers about the effectiveness and potential challenges associated with the integration of digital modes in literacy evaluation.


Methodology, Methods, Research Instruments or Sources Used
The present study uses PIRLS 2021 data for Sweden. Sweden participated in digital PIRLS 2021 with 5175 students. A bridge sample, separate and equivalent, was administered on paper for 1863 students (Almaskut et al., 2023).
The study aims to explore the potential mode effect in both paper-based and digital assessments, utilising item data from digital PIRLS and paper PIRLS. To assess and compare digital PIRLS and paper PIRLS as measures, we will employ a bifactor structural equation model, with a general reading achievement factor and specific factors representing the digital and paper formats. Constructing a bifactor model involves specifying key components to capture the nuances of reading achievement in both digital and paper formats. In this framework, a general reading achievement factor is introduced alongside specific factors representing the unique aspects of the digital and paper assessment modes. Notably, PIRLS categorizes reading into two broad purposes: reading for literary experience and reading to acquire and use information. Building upon this categorization, we will construct two variables based on the stated purposes of reading: the literary and the informational. We will explore how these variables contribute to reading achievement and whether there are variations in reading achievement between digital and paper formats. The model will incorporate paths from 'Literary’ and 'Information’ to both the general factor and specific factors. These paths facilitate the examination of how each observed variable influences the overall reading achievement and its specific manifestations in the digital and paper contexts. Additionally, observed indicators for each variable are included, ensuring a comprehensive representation of the constructs in the bifactor model. Furthermore, the analysis will control for socio-economic status (SES), immigrant background, and gender as variables while exploring mode effects or bias in either mode.


Conclusions, Expected Outcomes or Findings
The study will employ a bifactor model in the context of PIRLS 2021 data for Sweden to elucidate the multifaceted construct of reading literacy/achievement and potential mode effects between digital and paper formats. While the empirical results are pending, we anticipate several key outcomes. We expect to observe variations in the relationships between our latent constructs and observed indicators based on the mode of assessment.
Based on previous findings, we tentatively expect to discern the presence of both general and specific factors, indicating that there are unique aspects associated with digital and paper reading processes that significantly impact reading achievement beyond the shared aspects captured by the general factor. Our expectation is grounded in the understanding that different areas and processes of reading may exhibit varied patterns. For instance, we speculate that while informational reading might predominantly contribute to the general reading achievement factor, fictional or longer text reading may exhibit specific factors. This differentiation in our analysis aims to provide a more nuanced understanding of the complex relationships within the reading achievement construct, considering the diverse aspects of reading activities and processes associated with digital and paper formats. The complexities showed in our analyses may prompt inquiries into additional contextual factors, the stability of mode effects across different populations, and the longitudinal impact on reading outcomes. In conclusion, our study's expected outcomes encompass a comprehensive exploration of mode effects, the unique contributions of latent factors, the significance of specific indicators, implications for educational practice, and the identification of future research directions.

References
Almaskut, A., LaRoche, S., & Foy, P. (2023). Sample Design in PIRLS 2021. TIMSS & PIRLS International Study Center. https://doi.org/10.6017/lse.tpisc.tr2103.kb9560
Baron, N. S. (2021). Know what? How digital technologies undermine learning and remembering. Journal of Pragmatics, 175, 27–37. https://doi.org/10.1016/j.pragma.2021.01.011
Cheung, K., Mak, S., & Sit, P. (2013). Online Reading Activities and ICT Use as Mediating Variables in Explaining the Gender Difference in Digital Reading Literacy: Comparing Hong Kong and Korea. The Asia-Pacific Education Researcher, 22(4), 709–720. https://doi.org/10.1007/s40299-013-0077-x
Cho, B.-Y., Hwang, H., & Jang, B. G. (2021). Predicting fourth grade digital reading comprehension: A secondary data analysis of (e)PIRLS 2016. International Journal of Educational Research, 105, 101696. https://doi.org/10.1016/j.ijer.2020.101696
Delgado, P., Vargas, C., Ackerman, R., & Salmerón, L. (2018). Don’t throw away your printed books: A meta-analysis on the effects of reading media on reading comprehension. Educational Research Review, 25, 23–38. https://doi.org/10.1016/j.edurev.2018.09.003
Gilleece, L., & Eivers, E. (2018). Characteristics associated with paper-based and online reading in Ireland: Findings from PIRLS and ePIRLS 2016. International Journal of Educational Research, 91, 16–27. https://doi.org/10.1016/j.ijer.2018.07.004
Grammatikopoulou, E., Johansson, S., & Rosén, M., (2024). Paper-based and Digital Reading in 14 countries: Exploring cross-country variation in mode effects. Unpublished manuscript.
Jerrim, J., Micklewright, J., Heine, J.-H., Salzer, C., & McKeown, C. (2018). PISA 2015: How big is the ‘mode effect’ and what has been done about it? Oxford Review of Education, 44(4), 476–493. https://doi.org/10.1080/03054985.2018.1430025
Kingston, N. M. (2008). Comparability of Computer- and Paper-Administered Multiple-Choice Tests for K–12 Populations: A Synthesis. Applied Measurement in Education, 22(1), 22–37. https://doi.org/10.1080/08957340802558326
Krull, J. L., & MacKinnon, D. P. (2001). Multilevel Modeling of Individual and Group Level Mediated Effects. Multivariate Behavioral Research, 36(2), 249–277. https://doi.org/10.1207/S15327906MBR3602_06
Mullis, I. V. S., & Martin, M. O. (Eds.). (2015). PIRLS 2016 Assessment Framework (2nd ed.).
            Retrieved from Boston College, TIMSS & PIRLS International Study Center website:
            http://timssandpirls.bc.edu/pirls2016/framework.html
Rasmusson, M., & Åberg-Bengtsson, L. (2015). Does Performance in Digital Reading Relate to Computer Game Playing? A Study of Factor Structure and Gender Patterns in 15-Year-Olds’ Reading Literacy Performance. Scandinavian Journal of Educational Research, 59(6), 691–709. https://doi.org/10.1080/00313831.2014.965795


09. Assessment, Evaluation, Testing and Measurement
Paper

What’s the Effect of Person Nonresponse in PISA and ICCS?

Christian Tallberg, Daniel Gustafsson

The Swedish National Agency for Education

Presenting Author: Tallberg, Christian

International Large Scale Assessments (ILSA), such as PISA and ICCS, provide internationally comparative data on students' knowledge and abilities in various subjects. The results across assessments permit countries to make comparisons of their educational systems over time and in a global context. To make this possible, the implementation and the methodology on which the studies are based need to be rigorously standardized and of high quality. But even in a well-designed study, missing data almost always occurs. Missing data can reduce the statistical power of a study and can produce biased estimates, leading to invalid conclusions. The mechanisms by which missing data occurs are many. Such a mechanism emerge, for example, from studies based on low stake tests (which ILSA should be considered as). In low stake tests the students nor their teachers receive any feedback based on the students' results. Besides risking reduced validity of results from comparisons, both over time and between countries, low stake tests run the risk of giving rise to a greater proportion of missing data.

Sweden has a long tradition of high quality population administrative register data and this tradition has led us into having a great deal of data linked to the individuals via so-called social security numbers. It is relatively common for researchers and authorities to employ these high quality data in their analysis entailing more reliable results. The Swedish National Agency for Education regularly use register data when producing the official statistics and to a certain extent also when carrying out evaluation studies.

The ILSA:s, used to evaluate the condition of the Swedish schooling system, both by the Swedish National Educational Agency as well as by decision-makers and other stakeholders. To further the possibilities of secondary analyses and to increase relevance t to the national context, it is therefore pertinent to collate data from registers with data from the ILSA:s.

Historically, the Swedish National Agency for Education has only been able to link register data to ILSA data for participating students. This is because the participating students are considered as having given their consent for such linkages. However, before conducting PISA 2022 and ICCS 2022, the legal requirements (?) changed so that it became possible for the Swedish National Agency for Education to link register data also to nonresponding students, i.e. not only to the participating students.


Methodology, Methods, Research Instruments or Sources Used
The Swedish samples in PISA 2022 and ICCS 2022 consist of 7 732 15-year-olds and 3 900 students in grade 8, respectively. After the students who are to be excluded due to cognitive or physical impairment or alternatively due to not having good enough skills in the Swedish language, 7 203 in PISA 2022 and 3 632 in ICCS 2022 remain. Of these, the weighted student nonresponse is 15 percent and 13 percent in PISA and ICCS respectively. By employing register data, such as for example the students' final grades in primary school, migration background and the parents' level of education, on the full sample we have studied covariation of student nonresponses and student background characteristics (Swedish National Agency for Education, 2023a; Swedish National Agency for Education 2023b). Furthermore, we have carried out post-stratification type analyses (Little & Rubin, 2020)) to estimate the effect of nonresponses on students’ achievement. Finally, we compared students’ achievements, computed with PISA:s and ICCS rather non-informative nonparticipation adjusted weights, and students’ achievements computed with nonparticipation weights adjusted with register data. (OECD, 2023; IEA, 2023).
Conclusions, Expected Outcomes or Findings
The results indicate that student nonresponses lead to a bias that contributes to a certain overestimation of the students' average results Hence, Sweden's results seem to be too high given which students that participated as well as which did not participate. But, the overestimation differs between the two studies. In PISA the bias seems to be larger than the bias in ICCS and where the bias seems to lead to a significant overestimation of the students’ results in PISA the bias in ICCS seems to have a non-significant effect on the students’ results. Furthermore, we find that regardless of whether we study PISA or ICCS, two studies that differ methodologically in several aspects but are similar in the way of compensating for any person nonresponse bias, the effect of the missing-compensating elements on student’s achievements is negligible.
The results of this study in terms of how the missingness lead to overestimation of students' average results in ILSA:s are consistent with previously published studies, both in relation to ILSA:s (Micklewright et al., 2012; Meinck et.al., 2023) and more generally to sample studies in general (Groves & Peytcheva, 2008; Brick & Tourangeau, 2017). However, more would need to be done as we do not know the relationship between the proportion of missingness and the size of its’ bias. And we do not know if this relationship changes over time or how this relationship might differ in an international comparison. Furthermore, when compensating for missing data our results lead to the questioning of how reasonable it is to make the assumption of missing completely at random (MCAR). Something that is commonly done in ILSA:s given a sampled school or class.

References
Groves & Peytcheva. (2008). The Impact of Nonresponse Rates on Nonresponse Bias: A Meta-Analysis.
IEA. (2023). ICCS 2022 Technical Report.
Meinck et.al. (2023). Bias risks in ILSA related to non‑participation: evidence from a longitudinal large‑scale survey in Germany (PISA Plus)
Micklewright et al. (2012). Non-response biases in surveys of schoolchildren: the case of the English Programme for International Student Assessment (PISA) samples.
OECD. (2023). PISA 2022 Technical Report.
Roderick J. A. Little, Donald B. Rubin. (2020). Statistical Analysis with Missing Data, 3rd Ed.
Swedish National Agency for Education (2023a). ICCS 2022 metodbilaga.
Swedish National Agency for Education (2023b). PISA 2022 metodbilaga.


 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: ECER 2024
Conference Software: ConfTool Pro 2.6.153+TC
© 2001–2025 by Dr. H. Weinreich, Hamburg, Germany