Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Please note that all times are shown in the time zone of the conference. The current conference time is: 10th May 2025, 02:18:56 EEST

 
 
Session Overview
Session
09 SES 12 B: Reimagining Assessment Practices and Teacher Autonomy
Time:
Thursday, 29/Aug/2024:
15:45 - 17:15

Session Chair: Ana María Mejía-Rodríguez
Location: Room 012 in ΧΩΔ 02 (Common Teaching Facilities [CTF02]) [Ground Floor]

Cap: 56

Paper Session

Show help for 'Increase or decrease the abstract text size'
Presentations
09. Assessment, Evaluation, Testing and Measurement
Paper

Do teachers prefer to be free? Teachers’ Appreciation of Autonomy in students' assessment as a Personal Interpretation of Professional Reality

Orit Schwarz-Franco‬‏1, Linor L. Hadar2

1Beit Ber Academic College, Israel; 2University of Haifa

Presenting Author: Schwarz-Franco‬‏, Orit; Hadar, Linor L.

Our aim in this study was to learn about teachers’ understanding and appreciation of their autonomy in the context of student's assessment. The study’s context was a reform in Israel’s national matriculation exams (declared in 2022), that involved transitioning from external state-governed examinations into school-based assessment. The reform triggered discussions and re-evaluation of teachers’ professional autonomy, and of assessment policy. In this context we explored teachers' perceptions of the effect of assessment on professional autonomy. Furthermore, we broadened the scope of our study beyond the confines of the local reform, utilizing this specific case to draw more general insights regarding how teachers attribute significance to the professional conditions within which they work and how these conditions effect their sense of autonomy. We looked at the relation between autonomy in assessment, and autonomy in other aspects of teachers work. Furthermore, we studied the role of autonomy in the teachers' professional identity.

Our main research questions were: Which factors do the teachers acknowledge as contributing to their sense and preferences of autonomy? What are teachers' perceptions of the effect of assessment on their professional autonomy?

The theoretical framework of the study includes several types of literature.

First, we draw on a philosophical analysis of teachers’ freedom and responsibility, based on Existential philosophy of Jean-Paul Sartre (1946/ 2017). Teachers’ professional identity has been recognized as an extreme case of human destiny portrayed by Sartre (Author 1, 2022). While practicing the art of teaching (Schwab, 1983), teachers have a constant need to make choices in class, interpreting system’s regulations, practicing an inevitable autonomy, and exerting professional responsibility.

Secondly, we looked at current studies, and learned that teacher autonomy research mirrors trends in national and global education. Several studies indicate the favorable effects of teacher autonomy on teachers' perceived self-efficacy, work satisfaction, and empowerment, and on creating a positive work climate. They also show constraints on autonomy correlate with teacher turnover and the risk of emotional exhaustion, and burnout (Skaalvik & Skaalvik,2014).

Despite the recognition of the importance of teacher autonomy for job satisfaction (Juntunen, 2017), successful schools, and professional development (Wermke et al., 2019), there is less consensus on its definition (Pearson & Moomaw 2005). Autonomous teachers have a high control over daily practice issues (Wermke et al., 2019). Friedman’s scale for teacher-work autonomy (TWA 1999) includes four functioning areas pertinent to teachers’ sense of autonomy: class teaching, school operating, staff development, and curriculum development. In a re-evaluation of Friedman's scale (Strong & Yoshida, 2014), the number of autonomy areas grew to six and included assessment. In this paper we adopt Lennert-Da Silva’s (2022) definition which relates to the decision-making scope and control teachers have in relation to the national educational policy.

Thirdly, we read studies that look at autonomy in the context of student assessment and examine it as part of the larger theme of accountability. In the context of marketization, schools’ decentralization places school leaders within a framework including bureaucratic regulations, discourses of competitive enterprise, and external public accountability measures, that are spreading worldwide (Hammersley-Fletcher et al., 2021; Verger et al., 2019).

External assessment is a central factor in accountability (Ben-Peretz, 2012). High-stakes accountability casts a shadow on teachers' professional practice (Clarke, 2012; Mausethagen & Granlund, 2012), and their everyday practice is constrained by external testing.(Ball, 2003, 2008a, 2008b). Focusing on assessment as one expression of accountability, studies discuss the tension between external testing and autonomy. State-controlled assessment is viewed as a shift away from teacher professionalism towards the adoption of teaching methods that erode teacher autonomy in9 curriculum development and instructional decision-making (Day & Smethem, 2009).


Methodology, Methods, Research Instruments or Sources Used
Drawing on existential philosophy and empirical literature on the connection between student assessment and teacher autonomy, we adopted a qualitative approach, and we conducted in-depth interviews with 12 teachers, who were selected from four diverse schools, to ensure a broad representation of student populations.
For our data collection We employed a semi-structured interview format that began with general questions, giving the teachers an opportunity to freely express their perspectives on their autonomy. We aimed to ascertain whether teachers would refer to assessment processes and to the reform, as aspects of autonomy and factors in their general work experience before we asked them specifically about these topics. We asked: Do you like your work? What aspects contribute to your enjoyment in teaching? What factors disturb you or minimize your satisfaction? Do you feel free at work?
The subsequent phase of the interview centered on the matriculation reform, exploring whether teachers had perceived alterations to their level of autonomy. We used questions like: How do you usually evaluate your students? What is your opinion about the reform in the matriculation examinations? The interviews lasted one and a half hours, on average. They were conducted face to face, recorded, and later transcribed.
To analyze our data, we utilized inductive qualitative content analysis methodology (Cho & Lee 2014), We conducted open coding of the data, asking questions such as: What do the teachers' responses reveal about their views about the ‘is’ and the ‘ought’ of their professional autonomy? Do they see a difference between internal assessment (INA) and external assessment (EXA) as factors influencing their autonomy? This procedure resulted in preliminary categories.
Next, we explored the data to identify commonalities, disparities, complementarities, and interconnections among the teachers, while also considering their individual characteristics. To ensure trustworthiness, the categories obtained from this procedure were abstracted by each researcher individually. We then compared notes and agreed on the final categorization scheme.
The overarching categories addressing the two research questions relate to professional circumstances: the national education system and the school in which each teacher works. As informed by inductive data analysis methodology, the analysis process also revealed professional qualities that influence teachers' view of autonomy. These were specifically identified by the teachers in the interviews and included professional confidence and a sense of purpose. The final categorial scheme is concerned not only with the individual categories but more significantly with their arrangement and interplay.

Conclusions, Expected Outcomes or Findings

Overall, our analysis shows that teachers' sense and preference of autonomy, as expressed in their response to the matriculation reform, stemmed from personal subjective interpretation of the objective circumstances of their professional environment.
Despite diverse attitudes, the majority of teachers express a preference for autonomy, especially in assessment. Given the global teacher shortage and challenges in retaining high-quality teachers (García et al., 2022; Guthery & Bailes, 2022), recognizing that external assessments constrain teachers' experienced autonomy has significant implications for policymakers deciding on state assessments.
The teachers highlighted the significance of two elements shaping their professional experience, and determining the degree of autonomy they have: the national education system and the school. They referred to assessment as a clear example of the complex interplay between those two elements; However, the teachers emphasized a holistic approach to autonomy, in which assessment cannot stand alone. For them, autonomy included curricular planning and assessment design together.
Moreover, teachers’ appreciation of their autonomy is inspired by two professional qualities: confidence and a sense of purpose. This conclusion, regarding the relationship between teachers’ confidence, sense of purpose, and their views about autonomy, bares important conclusion for teacher professional learning and development, as well as for teacher education.
We recognize the need for further elaboration of this conclusion, designing ways to enhance and promote these professional qualities as part of the shaping of professional identity of novice teachers, as well as that of experienced teachers .  

References
Ball, S. (2008b). Performativity, privatisation, professionals and the state. In B. Cunningham (Ed.), Exploring professionalism (pp. 50–72). Institute of Education
Day, C., & Smethem, L. (2009). The effects of reform: Have teachers really lost their sense of professionalism? Journal of Educational Change, 10, 141–157.
Ben-Peretz, M. (2012). Accountability vs. teacher autonomy: An issue of balance. In The Routledge international handbook of teacher and school development (pp. 83-92). Routledge.
Cho, J. Y., & Lee, E. H. (2014). Reducing confusion about grounded theory and qualitative content analysis: Similarities and differences. Qualitative report, 19(32), 1-20.

Friedman, I. A. (1999). Teacher-perceived work autonomy: The concept and its measurement. Educational and Psychological Measurement, 59(1), 58-76.

García, E., Han, E., & Weiss, E. (2022). Determinants of teacher attrition: Evidence from district-teacher matched data. Education Policy Analysis Archives, 30(25), n25.
Guthery, S., & Bailes, L. P. (2022). Building experience and retention: the influence of principal tenure on teacher retention rates. Journal of Educational Administration, 60(4), 439-455.

Hammersley-Fletcher, L., Kılıçoğlu, D., & Kılıçoğlu, G. (2021). Does autonomy exist? Comparing the autonomy of teachers and senior leaders in England and Turkey. Oxford Review of Education, 47(2), 189-206.

Juntunen, M. L. (2017). National assessment meets teacher autonomy: national assessment of learning outcomes in music in Finnish basic education. Music Education Research, 19(1), 1-16.
Lennert Da Silva, A. L. (2022). Comparing teacher autonomy in different models of educational governance. Nordic Journal of Studies in Educational Policy, 8(2), 103-118.
Pearson, L. C., & Moomaw, W. (2005). The relationship between teacher autonomy and stress, work satisfaction, empowerment, and professionalism. Educational Research Quarterly, 29(1), 38-54.

Sartre, J. P. (1946 / 2017). Existentialism is a humanism (C. Macomber, Trans.). Yale University Press.
Schwab, J. J. (1973). The practical 3: Translation into curriculum. The School Review, 81(4), 501-522.
Schwab, J. J. (1983). The practical 4: Something for curriculum professors to do. Curriculum Inquiry, 13(3), 239-265.
Skaalvik, E. M., & Skaalvik, S. (2014). Teacher self-efficacy and perceived autonomy: Wermke, W., Olason Rick, S., & Salokangas, M. (2019). Decision-making and control: Perceived autonomy of teachers in Germany and Sweden. Journal of Curriculum Studies, 51(3), 306-325.
Strong, L. E., & Yoshida, R. K. (2014). Teachers’ autonomy in today's educational climate: Current perceptions from an acceptable instrument. Educational Studies, 50(2), 123-145.


09. Assessment, Evaluation, Testing and Measurement
Paper

How Well Can AI Identify Effective Teachers?

Michael Strong1, John Gargani2, Minju Yi1, Ibrahim Akdilek1

1Texas Tech University, United States of America; 2Gargani & Co Inc

Presenting Author: Strong, Michael; Gargani, John

We report ongoing research that assesses how well AI can evaluate teaching, which we define as “effective” to the degree it helps students learn. Our current research builds on a body of prior work in which we assessed how well human judges performed the same task. Under varying conditions (length of instructional sample; instruction documented as video, audio, and transcript; and judgments based on intuition alone, high-inference rubrics, and low-inference rubrics) human judges demonstrate significant limitations. Experts and nonexperts did no better than chance when they relied solely on their intuitive judgment. Experts fared no better when using high-inference rubrics. However, experts and nonexperts were more accurate than chance when they used low-inference rubrics, and just as accurate using transcripts of instruction compared to video. Machines are very good at performing low-inference tasks, and AI in particular is very good at “understanding” written text, such as transcripts. Is AI better at judging teaching effectiveness from transcripts than humans? If so, should human judges be replaced by machines? We provide data that may help answer these questions, and engage our audience in a discussion of the moral dilemmas it poses.


Methodology, Methods, Research Instruments or Sources Used
We investigate two types of evaluative judgments—unstructured and structured. Unstructured judgments were investigated by asking subjects to “use what they know” to classify classroom instruction of known quality as being of either high or low effectiveness. Structured judgments were investigated by asking subjects to count the occurrences of six concrete teaching behaviors using the RATE rubric. The performance of two groups of subjects are compared—human judges and AI. The tasks with human subjects are replications of experiments we previously conducted and published (Strong et al, 2011; Gargani & Strong, 2104; 2015). We are, therefore, able to compare the performance of AI and humans on the same tasks at the same time, as well as to human judges in previous studies. A contribution of our work concerns the difficult problem of developing prompts for AI that instruct it to complete the evaluation tasks. Our protocol is iterative—we developed and piloted prompts, revised them, piloted again, and so on until satisfied that any failure to complete a task well would not be attributable to weaknesses in the prompts. We developed our own criteria for prompts, which we will share. One hundred human subjects were recruited to act as a benchmark for the AI, and they use an online platform to complete the tasks. Comparisons of accuracy and reliability will be made across groups and tasks, providing a basis for judging the relative success of AI and human judges.
Conclusions, Expected Outcomes or Findings
We hypothesize that the use of lesson transcripts versus video or audio only will reduce the sources of bias such that humans will be able to more accurately distinguish between above-average and below-average teachers. We further hypothesize that AI will be more accurate than humans, and can be successfully trained to produce reliable evaluations using a formal observation system.
References
Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis methods (2nd ed.). Thousand Oaks, CA: Sage.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 470-428.
Strong, M. (2011). The highly qualified teacher: What is teacher quality and how do we measure it? New York: Teachers College Press.
Strong,M., Gargani, J., & Hacifazlioğlu, Ö. (2011). Do we know a successful teacher when we see one? Experiments in the identification of effective teachers. Journal of Teacher Education, 20(10), 1-16.


09. Assessment, Evaluation, Testing and Measurement
Paper

The Evaluation of Online Content. Development and Empirical Evaluation of a Measurement Instrument for Primary School Children

Tina Jocham, Sanna Pohlmann-Rother

University of Wuerzburg, Germany

Presenting Author: Jocham, Tina

Children today are growing up in a digitally connected world which sets them apart from previous generations. For example, 42% of 5- to 7-year-olds have their own tablet and 93% of 8- to 11-year-olds spend an average of 13.5 hours online (Ofcom 2022). Digital media provides opportunities for easier access to information and communication with peers. However, it also presents a range of risks, especially for children who are particularly vulnerable due to their young age. This becomes clear when they are confronted with violent, sexual, advertising, or judgmental content in the digital space (Livingstone et al. 2015). Other challenges in digital communication and information channels include fake news, propaganda and deepfakes. With regard to the aforementioned aspects, it is necessary to possess skills that enable a critical examination of information. For this reason, information evaluation is considered an important subskill for social participation and learning inside and outside of school.

When examining the internet preferences of children and young people, it becomes apparent that they are primarily interested in extracurricular activities rather than child-friendly services commonly discussed in school settings, such as children's search engines. The top four internet activities include WhatsApp, watching films and videos, and using YouTube and search engines (Feierabend et al., 2023). In this respect, WhatsApp, YouTube, and TikTok are the most popular (social media) platforms (Reppert-Bismarck et al. 2019). The evaluation of content is not limited to online research alone. It can also occur in different scenarios, such as browsing the internet for entertainment or out of boredom). In this regard, the strategies for evaluating content vary depending on the purpose of the discussion (Weisberg et al. 2023), allowing the assessment of information, data, and content from different angles. One approach to evaluate content is to verify its credibility. In research literature, credibility encompasses multiple aspects. This includes assessing the trustworthiness of content, such as recognizing intention, or the expertise of the author. However, studies show that young people tend to lack critical evaluation skills when it comes to the credibility of online content (Kiili et al. 2018) and are also insufficiently prepared to verify the truthfulness of information (Hasebrink et al. 2019). In the context of social media in particular, the question of the realism of the shared content (e.g., factuality or plausibility) arises. Recipients are faced with the challenge of multiplicity resulting from the different ‘realities’ on social media. These realities are shaped by different motivations, attitudes, and political or social contexts which can blur boundaries (Cho et al. 2022). Overall, the evaluation process of online content is influenced by various factors. For instance, research suggests that reading competence affects the evaluation process. Furthermore, the socioeconomic status has been found to influence the digitalization-related skills of young people (see ICILS results). Another important aspect to consider is the influence of platform-specific knowledge, such as understanding the YouTube algorithm, and topic-specific knowledge on content evaluation, such as the subject of a news video. In addition, the design of both the platform and the content can also have an impact. This includes factors such as image-to-text ratio, layout, effects, and the focus of the central message.

To which extent the presented assumptions apply to primary school children is unclear, as most empirical results relate to adults or adolescents. Therefore, the overarching goal of the project is to develop a standardized measurement instrument for primary school children in order to assess to which extent they are able to evaluate internet content. The creation of a standardized measurement instrument involves several substeps which are outlined below.


Methodology, Methods, Research Instruments or Sources Used
Model
The development of a measurement instrument requires a theoretical and empirical foundation. We believe there is a limited number of models that specifically address the evaluation of online content in primary school children. Therefore, we examined constructs related to the subcompetence of 'evaluation' to develop a theory- and empirically-based measurement model. For this purpose, we used normatively formulated standards, theoretical models and empirical studies that systematize, assess, or discuss information, media, digital, internet and social media skills. The analysis of these constructs can yield various criteria for evaluating online content, such as credibility or realism. For instance, context is crucial when evaluating content (e.g., advertising content; Purington Drake et al., 2023). As most of the analysis is not related to primary schools, all German curricula (e.g., based on DigComp, Ferrari 2013) were examined for relevant subcompetencies and content areas. The aim is to compare the research results with normative requirements in the primary school sector to ensure that competence targets are not set unrealistically high.

Assessment instrument
Based on the measurement model, we developed a digital performance test with 20 multiple-choice tasks. To increase content validity, the instrument includes multimodal test items from the age group's most popular platforms (e.g., YouTube). The operationalization includes phenomena that are platform-specific (e.g., clickbait). Assessment criteria were derived for each content area and subcompetency and adapted to the specific platform content, such as a promotional video with child influencers. Expert interviews in the online children's sector additionally contributed to the development of age-appropriate content and evaluation criteria (Brückner et al. 2020).

Validation steps/procedures
To validate the 20 test items, a qualitative comprehensibility analysis was conducted in small group discussions with school and university experts (n=12). Following that, five children were accompanied by the thinking aloud method while they solved the test items (Brandt and Moosbrugger 2020). Both validation steps led to linguistic and content-related adjustments.

Pilot study
An initial test of the measurement instrument was conducted with 81 pupils (56.8% female) in Grade 3/4 (M=10.4, SD=0.64). 57 children were given parental permission to provide information on their socioeconomic status (HISEI=47.44, SD=16.42). 51.9% predominantly speak another language at home. The aim of the pilot study was to perform an initial descriptive item analysis to determine task difficulty, variance, and selectivity. The calculation of an overall score requires item homogeneity, wherein high selectivity indices serve as an initial indication (Kelava and Moosbrugger 2020).

Conclusions, Expected Outcomes or Findings
The results of the piloting showed that 15 out of 20 test items had a task difficulty of 45≤Pi≤78. Five items had a higher difficulty (25≤Pi≤39). These items primarily dealt with phishing, clickbait, the use of third-party data, and bots. The correlative relationships calculation showed an inconsistent picture for the respective tasks which resulted in low selectivity indices (rit<.3) in some cases. Due to the small sample size, it was not possible to definitely determine whether the data had a unidimensional or multidimensional structure (principal component analysis/varimax rotation). As a result, the selectivity was not further interpreted (Kelava and Moosbrugger 2020).
It is not surprising that students struggled with test tasks involving deception and personality interference, as even adults find phenomena like bots to be challenging (Wineburg et al. 2019). This raises the question of whether this content is appropriate for primary schools despite its real-world relevance. Methodological challenges in investigating such phenomena and implications for school support are discussed in the main study.
As a result of the pilot study, the five most challenging tasks were adjusted in terms of difficulty without altering the core content (e.g., linguistic adaptations of questions/answers, replacement of videos). To obtain precise information on unidimensionality, IRT models were utilized for data analysis in the main study (Kelava and Moosbrugger 2020). The data collection was completed in December 2023 (n=672) and aims to gain more precise insights into item and test quality. The quality results of the measurement instrument will be reported at the conference with a focus on the area of deception. This study raises the question of whether primary school children are able to evaluate deceptive content and what methodological challenges this poses for measurement. This study will investigate whether individual variables (socioeconomic status, migration history) influence the evaluation of deceptive content.

References
Brandt, Holger; Moosbrugger, Helfried (2020): Planungsaspekte und Konstruktionsphasen von Tests und Fragebogen. In: Helfried Moosbrugger und Augustin Kelava (Hg.): Testtheorie und Fragebogenkonstruktion. Berlin, Heidelberg: Springer Berlin Heidelberg, S. 41–66.
Brückner, Sebastian; Zlatkin-Troitschanskaia, Olga; Pant, Hans Anand (2020): Standards für pädagogisches Testen. In: Helfried Moosbrugger und Augustin Kelava (Hg.): Testtheorie und Fragebogenkonstruktion. Berlin, Heidelberg: Springer Berlin Heidelberg, S. 217–248.
Cho, Hyunyi; Cannon, Julie; Lopez, Rachel; Li, Wenbo (2022): Social media literacy: A conceptual framework. In: New Media & Society, 146144482110685. DOI: 10.1177/14614448211068530.
Feierabend, Sabine; Rathgeb, Thomas; Kheredmand, Hediye; Glöckler, Stephan (2023): KIM-Studie 2022 Kindheit, Internet, Medien. Basisuntersuchung zum Medienumgang 6-bis 13-Jähriger. Hg. v. Medienpädagogischer Forschungsverbund Südwest (mpfs). Online verfügbar unter https://www.mpfs.de/studien/kim-studie/2022/.
Ferrari, Anusca (2013): DIGCOMP: A Framework for Developing and Understanding Digital Competence in Europe. Eurpean Commission Joint Research Center. Online verfügbar unter https://publications.jrc.ec.europa.eu/repository/handle/JRC83167, zuletzt geprüft am 16.05.2023.
Hasebrink, Uwe; Lampert, Claudia; Thiel, Kira (2019): Online-Erfahrungen von 9- bis 17-Jährigen. Ergebnisse der EU Kids Online-Befragung in Deutschland 2019. 2. überarb. Auflage. Hamburg: Verlag Hans-Bredow.
Kelava, Augustin; Moosbrugger, Helfried (2020): Deskriptivstatistische Itemanalyse und Testwertbestimmung. In: Helfried Moosbrugger und Augustin Kelava (Hg.): Testtheorie und Fragebogenkonstruktion. Berlin, Heidelberg: Springer Berlin Heidelberg, S. 143–158.
Kiili, Carita; Leu, Donald J.; Utriainen, Jukka; Coiro, Julie; Kanniainen, Laura; Tolvanen, Asko et al. (2018): Reading to Learn From Online Information: Modeling the Factor Structure. In: Journal of Literacy Research 50 (3), S. 304–334. DOI: 10.1177/1086296X18784640.
Livingstone, S.; Mascheroni, G.; Staksrud, E. (2015): Developing a framework for researching children’s online risks and opportunities in Europe. EU Kids Online. Online verfügbar unter https://eprints.lse.ac.uk/64470/1/__lse.ac.uk_storage_LIBRARY_Secondary_libfile_shared_repository_Content_EU%20Kids%20Online_EU%20Kids%20Online_Developing%20framework%20for%20researching_2015.pdf, zuletzt geprüft am 11.01.2024.
Ofcom (2022): Children and parents: media use and attitudes report. Online verfügbar unter https://www.ofcom.org.uk/__data/assets/pdf_file/0024/234609/childrens-media-use-and-attitudes-report-2022.pdf.
Purington Drake, Amanda; Masur, Philipp K.; Bazarova, Natalie N.; Zou, Wenting; Whitlock, Janis (2023): The youth social media literacy inventory: development and validation using item response theory in the US. In: Journal of Children and Media, S. 1–21. DOI: 10.1080/17482798.2023.2230493.
Reppert-Bismarck; Dombrowski, Tim; Prager, Thomas (2019): Tackling Disinformation Face to Face: Journalists' Findings From the Classroom. In: Lie Directors.
Weisberg, Lauren; Wan, Xiaoman; Wusylko, Christine; Kohnen, Angela M. (2023): Critical Online Information Evaluation (COIE): A comprehensive model for curriculum and assessment design. In: JMLE 15 (1), S. 14–30. DOI: 10.23860/JMLE-2023-15-1-2.
Wineburg, Sam; Breakstone, Joel; Smith, Mark; McGrew, Sarah; Ortega, Teresa (2019): Civic Online Reasoning: Curriculum Evaluation (working paper 2019-A2, Stanford History Education Group, Stanford University). Online verfügbar unter https://stacks.stanford.edu/file/druid:xr124mv4805/COR%20Curriculum%20Evaluation.pdf, zuletzt geprüft am 29.06.2023.


09. Assessment, Evaluation, Testing and Measurement
Paper

A Multilevel Meta-Analysis of the Validity of Student Rating Scales in Teaching Evaluation. Which Psychometric Characteristics Matter Most?

Daniel E. Iancu1,2, Marian D. Ilie1, Laurențiu P. Maricuțoiu1

1West University of Timisoara, Romania; 2University of Bucharest, Romania

Presenting Author: Iancu, Daniel E.

Student Teaching Evaluation (STE) is the procedure by which teaching performance is measured and assessed through questionnaires administered to students. Typically, these questionnaires or scales refer to the teaching practices of academic staff and are conducted in one of the last meetings of the semester. Generally, and from a practical standpoint, the primary purpose of implementing this procedure is the necessity of universities to report STE results to quality assurance agencies. Another main objective of STE procedures, and certainly the most important from a pedagogical perspective, is to provide feedback to teachers about their teaching practices.

Previous studies on the highlighted topic present arguments both for and against the validity and utility of STE. On one hand, there are studies suggesting that STE results are influenced by other external variables, such as the teacher's gender or ethnicity (e.g., Boring, 2017), lenient grading (e.g., Griffin, 2004), or even the teacher's personality (e.g., Clayson & Sheffet, 2006).

On the other hand, there are published works showing that STE scales are valid and useful (e.g., Hammonds et al., 2017; Wright & Jenkins, 2012). Furthermore, when STE scales are rigorously developed and validated, as is the case with SEEQ (Marsh, 1982, 2009), there is a consistent level of agreement and evidence suggesting that STE scale scores are multidimensional, precise, valid, and relatively unaffected by other external variables (Marsh, 2007; Richardson, 2005; Spooren et al., 2013).

Even though this debate was very active in the 1970s and the evidence leaned more in favor of STE validity (Richardson, 2005; Marsh, 2007), a recent meta-analysis (Uttl et al., 2017) presented evidence that seriously threatens the validity of STE results. They suggest that there is no relationship between STE results and student performance levels. The existence of this relationship is vital for the debate on STE validity, starting from the premise that if STE results accurately reflect good or efficient teaching, then teachers identified as more performant should facilitate a higher level of performance among their students.

In light of all the above and referring to the results of the meta-analysis conducted by Uttl et al. (2017), the present study aims to investigate whether the relationship between STE results and student learning/performance is stronger when the STE scale used is more rigorously developed and validated. For this purpose, a multilevel meta-analysis was conducted, allowing us to consider multiple effect sizes for each study included in the analysis.

The results of this study can be useful in nuancing the picture of the validity of STE scales, in the sense that they can show us whether scales developed and validated in accordance with field standards can measure the quality of teaching more correctly and precisely. Additionally, this research can help outline a picture of which psychometric characteristics of STE scales contribute to a better measurement of teaching efficiency/effectiveness.

Therefore, the research questions guiding the present study are as follows:

  1. What is the average effect size of the relationship between STE results and student performance, in all STE studies with multiple sections published to date?
  2. Does the average effect of the relationship between STE results and student performance differ based on evidence regarding the validity of the STE scales used?
  3. Does the average effect of the relationship between STE results and student performance differ based on the content of the dimensions of the STE scales used?
  4. Does the average effect of the relationship between STE results and student performance differ based on the level of observability of the teaching behaviors in the items that make up the STE scales used?

Methodology, Methods, Research Instruments or Sources Used
The present study is a multilevel meta-analysis on the relationship between STE (Student Teaching Evaluation) results and student performance in multi-section STE studies, and on the moderating effect of this relationship, of different psychometric characteristics (level and type of validity evidence of the STE scales, the content of dimensions, and the level of observability/clarity of the items) of the STE scales used in these studies.

To be included in this meta-analysis, a study had to meet the following inclusion criteria:
1. Present correlational results between STE results and student performance.
2. Analyze the relationship between STE results and student performance in multiple sections of the same discipline (“multi-section STE studies”).
3. Students completed the same STE scale and the same performance assessment tests.
4. Student performance was measured through objective assessments focusing on actual learning, not students' perceptions of it.
5. The correlation between STE results and student performance was estimated using aggregate data at the section level, not at the individual student level.

The search for studies in the specialized literature was conducted through three procedures: 1) analysis of the reference list of similar meta-analyses; 2) examination of all articles citing Uttl (2017); 3) use of a search algorithm in the Academic Search Complete, Scopus, PsycINFO, and ERIC databases. After analyzing the abstracts and reading the full text of promising studies, 43 studies were identified and extracted that met the inclusion criteria.

For coding the level of validity evidence of the STE measures used, we adapted a specific framework of psychometric evaluation criteria, proposed by Hunsley & Mash (2008). In adapting the previously mentioned evaluation framework, the recommendations put forth by Onwuegbuzie (2009) and the recommendations of AERA, APA & NCME (2014) were also considered. For coding the level of observability/clarity of the items that make up the STE scales used in the analyzed studies, we created a coding grid based on Murray (2007), which presents and explains the importance of using items with a high degree of measurability to reduce the subjectivity of the students responding to these items.

The data were analyzed in R (metafor package) using the multilevel meta-analysis technique because most of the included studies report multiple effect sizes, usually one for each dimension of the STE scale. This type of analysis helps to better calculate average effects, starting from the original structure of the data presented in the primary studies.

Conclusions, Expected Outcomes or Findings
The obtained results suggest that: 1) STE (Student Teaching Evaluation) scales with more validity evidence tend to measure teaching effectiveness better; 2) there is a set of dimensions that are more suitable than others for correctly measuring teaching effectiveness (for example, clarity of presentation, instructor enthusiasm, interaction with students, and availability for support had the strongest relationships with performance); and 3) the degree of observability of the items that make up the STE scales is a major factor regarding the ability of these scales to accurately measure teaching effectiveness.

Regarding the level of observability of the items contained in the STE scales, they were divided into 3 categories (low/medium/high observability) and the relationship between STE results and student performance was comparatively analyzed for each category. As expected, the moderating effect is significant, meaning that there are significant differences between the correlations obtained within each category of studies. The strongest relationships exist in the case of items with a high degree of observability, and as this degree of observability decreases, the intensity of the correlation between STE results and student performance also significantly decreases.

These results can help nuance the picture of the validity of STE scales, suggesting that STE scales developed and validated in accordance with the standards of the field can measure the quality of teaching more correctly and precisely. It can also be said that the proposed dimensionality and the level of observability of the items are of major importance in the development of any STE scale. These recommendations can be useful in any process of development or adaptation of an STE scale for use in the process of ensuring the quality of teaching in the university environment.

References
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Boring, A. (2017). Gender biases in student evaluations of teaching. Journal of public economics, 145, 27-41.

Clayson, D. E., & Sheffet, M. J. (2006). Personality and the student evaluation of teaching. Journal of Marketing Education, 28, 149–160.

Griffin, B. W. (2004). Grading leniency, grade discrepancy, and student ratings of instruction. Contemporary Educational Psychology, 29, 410–425.

Hammonds, F., Mariano, G. J., Ammons, G., & Chambers, S. (2017). Student evaluations of teaching: improving teaching quality in higher education. Perspectives: Policy and Practice in Higher Education, 21(1), 26-33.

Hunsley, J., & Mash, E. J. (2008). Developing criteria for evidence-based assessment: An introduction to assessments that work. A guide to assessments that work, 2008, 3-14.

Marsh, H. W. (2007). Students’ evaluations of university teaching: Dimensionality, reliability, validity, potential biases and usefulness. In P.R., Pintrich & A. Zusho (Coord.), The scholarship of teaching and learning in higher education: An evidence-based perspective (pp. 319-383). Springer, Dordrecht.

McPherson, M. A., Todd Jewell, R., & Kim, M. (2009). What determines student evaluation scores? A random effects analysis of undergraduate economics classes. Eastern Economic Journal, 35, 37–51.

Onwuegbuzie, A. J., Daniel, L. G., & Collins, K. M. (2009). A meta-validation model for assessing the score-validity of student teaching evaluations. Quality & Quantity, 43(2), 197-209.

Richardson, J. T. (2005). Instruments for obtaining student feedback: A review of the literature. Assessment & evaluation in higher education, 30(4), 387-415.

Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation of teaching: The state of the art. Review of Educational Research, 83(4), 598-642.

Spooren, P., Vandermoere, F., Vanderstraeten, R., & Pepermans, K. (2017). Exploring high impact scholarship in research on student's evaluation of teaching (SET). Educational Research Review, 22, 129-141.

Uttl, B., White, C. A., & Gonzalez, D. W. (2017). Meta-analysis of faculty's teaching effectiveness: Student evaluation of teaching ratings and student learning are not related. Studies in Educational Evaluation, 54, 22-42.

Wright, S. L., & Jenkins-Guarnieri, M. A. (2012). Student evaluations of teaching: Combining the meta-analyses and demonstrating further evidence for effective use. Assessment & Evaluation in Higher Education, 37(6), 683-699.


 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: ECER 2024
Conference Software: ConfTool Pro 2.6.153+TC
© 2001–2025 by Dr. H. Weinreich, Hamburg, Germany