-
-
Author(s): He, Jia; Barrera-Pedemonte, Fabian; Buchholz, Janine
Title: Cross-cultural comparability of noncognitive constructs in TIMSS and PISA
In: Assessment in Education, 26 (2019) 4, S. 369-385
DOI: 10.1080/0969594X.2018.1469467
URL: https://www.tandfonline.com/doi/full/10.1080/0969594X.2018.1469467
Publication Type: 3a. Beiträge in begutachteten Zeitschriften; Beitrag in Sonderheft
Language: Englisch
Keywords: PISA <Programme for International Student Assessment>; TIMSS <Third International Mathematics and Science Study>; Schülerleistung; Leistungsmessung; Mathematikunterricht; Naturwissenschaftlicher Unterricht; Freude; Motivation; Schule; Identifikation <Psy>; Sekundarstufe I; Schüler; Messverfahren; Vergleich; Item-Response-Theory; Faktorenanalyse; OECD-Länder
Abstract: Noncognitive assessments in Programme for International Student Assessment (PISA) and Trends in International Mathematics and Science Study share certain similarities and provide complementary information, yet their comparability is seldom checked and convergence not sought. We made use of student self-report data of Instrumental Motivation, Enjoyment of Science and Sense of Belonging to School targeted in both surveys in 29 overlapping countries to (1) demonstrate levels of measurement comparability, (2) check convergence of different scaling methods within survey and (3) check convergence of these constructs with student achievement across surveys. We found that the three scales in either survey (except Sense of Belonging to School in PISA) reached at least metric invariance. The scale scores from the multigroup confirmatory factor analysis and the item response theory analysis were highly correlated, pointing to robustness of scaling methods. The correlations between each construct and achievement was generally positive within each culture in each survey, and the correlational pattern was similar across surveys (except for Sense of Belonging), indicating certain convergence in the cross-survey validation. We stress the importance of checking measurement invariance before making comparative inferences, and we discuss implications on the quality and relevance of these constructs in understating learning. (DIPF/Orig.)
DIPF-Departments: Bildungsqualität und Evaluation
-
-
Author(s): Kroehne, Ulf; Buerger, Sarah; Hahnel, Carolin; Goldhammer, Frank
Title: Construct equivalence of PISA reading comprehension measured with paper‐based and computer‐based assessments
In: Educational Measurement, 38 (2019) 3, S. 97-111
DOI: 10.1111/emip.12280
URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/emip.12280
Publication Type: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Language: Englisch
Keywords: Einflussfaktor; Schülerleistung; Frage; Antwort; Interaktion; Unterschied; Vergleich; Item-Response-Theory; Deutschland; PISA <Programme for International Student Assessment>; Leseverstehen; Messverfahren; Testkonstruktion; Korrelation; Äquivalenz; Papier-Bleistift-Test; Computerunterstütztes Verfahren; Technologiebasiertes Testen; Leistungsmessung; Testverfahren; Testdurchführung
Abstract: For many years, reading comprehension in the Programme for International Student Assessment (PISA) was measured via paper‐based assessment (PBA). In the 2015 cycle, computer‐based assessment (CBA) was introduced, raising the question of whether central equivalence criteria required for a valid interpretation of the results are fulfilled. As an extension of the PISA 2012 main study in Germany, a random subsample of two intact PISA reading clusters, either computerized or paper‐based, was assessed using a random group design with an additional within‐subject variation. The results are in line with the hypothesis of construct equivalence. That is, the latent cross‐mode correlation of PISA reading comprehension was not significantly different from the expected correlation between the two clusters. Significant mode effects on item difficulties were observed for a small number of items only. Interindividual differences found in mode effects were negatively correlated with reading comprehension, but were not predicted by basic computer skills or gender. Further differences between modes were found with respect to the number of missing values.
Abstract (english): For many years, reading comprehension in the Programme for International Student Assessment (PISA) was measured via paper‐based assessment (PBA). In the 2015 cycle, computer‐based assessment (CBA) was introduced, raising the question of whether central equivalence criteria required for a valid interpretation of the results are fulfilled. As an extension of the PISA 2012 main study in Germany, a random subsample of two intact PISA reading clusters, either computerized or paper‐based, was assessed using a random group design with an additional within‐subject variation. The results are in line with the hypothesis of construct equivalence. That is, the latent cross‐mode correlation of PISA reading comprehension was not significantly different from the expected correlation between the two clusters. Significant mode effects on item difficulties were observed for a small number of items only. Interindividual differences found in mode effects were negatively correlated with reading comprehension, but were not predicted by basic computer skills or gender. Further differences between modes were found with respect to the number of missing values.
DIPF-Departments: Bildungsqualität und Evaluation
-
-
Author(s): Kroehne, Ulf; Hahnel, Carolin; Goldhammer, Frank
Title: Invariance of the response processes between gender and modes in an assessment of reading
In: Frontiers in Applied Mathematics and Statistics, (2019) , S. 5:2
DOI: 10.3389/fams.2019.00002
URL: https://www.frontiersin.org/articles/10.3389/fams.2019.00002/full
Publication Type: 3a. Beiträge in begutachteten Zeitschriften; Beitrag in Sonderheft
Language: Englisch
Keywords: Lesefertigkeit; Technologiebasiertes Testen; Computerunterstütztes Verfahren; Papier-Bleistift-Test; Antwort; Zeit; Messung; Item-Response-Theory; Modell; Geschlechtsspezifischer Unterschied; Logdatei; Datenanalyse; Empirische Untersuchung; Deutschland
Abstract: In this paper, we developed a method to extract item-level response times from log data that are available in computer-based assessments (CBA) and paper-based assessments (PBA) with digital pens. Based on response times that were extracted using only time differences between responses, we used the bivariate generalized linear IRT model framework (B-GLIRT, [1]) to investigate response times as indicators for response processes. A parameterization that includes an interaction between the latent speed factor and the latent ability factor in the cross-relation function was found to fit the data best in CBA and PBA. Data were collected with a within-subject design in a national add-on study to PISA 2012 administering two clusters of PISA 2009 reading units. After investigating the invariance of the measurement models for ability and speed between boys and girls, we found the expected gender effect in reading ability to coincide with a gender effect in speed in CBA. Taking this result as indication for the validity of the time measures extracted from time differences between responses, we analyzed the PBA data and found the same gender effects for ability and speed. Analyzing PBA and CBA data together we identified the ability mode effect as the latent difference between reading measured in CBA and PBA. Similar to the gender effect the mode effect in ability was observed together with a difference in the latent speed between modes. However, while the relationship between speed and ability is identical for boys and girls we found hints for mode differences in the estimated parameters of the cross-relation function used in the B-GLIRT model. (DIPF/Orig.)
DIPF-Departments: Bildungsqualität und Evaluation
-
-
Author(s): Naumann, Alexander; Musow, Stephanie; Aichele, Christine; Hochweber, Jan; Hartig, Johannes
Title: Instruktionssensitivität von Tests und Items
In: Zeitschrift für Erziehungswissenschaft, 22 (2019) 1, S. 181-202
DOI: 10.1007/s11618-018-0832-0
URL: https://link.springer.com/article/10.1007%2Fs11618-018-0832-0
Publication Type: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Language: Deutsch
Keywords: Unterricht; Effektivität; Schülerleistung; Leistungsmessung; Test; Messverfahren; Empirische Forschung; Konzeption; Validität; Daten; Interpretation; Psychometrie; Item-Response-Theory; Modell
Abstract: Testergebnisse von Schülerinnen und Schülern dienen regelmäßig als ein zentrales Kriterium für die Beurteilung der Effektivität von Schule und Unterricht. Gültige Rückschlüsse über Schule und Unterricht setzen voraus, dass die eingesetzten Testinstrumente mögliche Effekte des Unterrichts auffangen können, also instruktionssensitiv sind. Jedoch wird diese Voraussetzung nur selten empirisch überprüft. Somit bleibt mitunter unklar, ob ein Test nicht instruktionssensitiv oder ein Unterricht nicht effektiv war. Die Klärung dieser Frage erfordert die empirische Untersuchung der Instruktionssensitivität der eingesetzten Tests und Items.
Während die Instruktionssensitivität in den USA bereits seit Langem diskutiert wird, findet das Konzept im deutschsprachigen Diskurs bislang nur wenig Beachtung. Unsere Arbeit zielt daher darauf ab, das Konzept Instruktionssensitivität in den deutschsprachigen Diskurs über schulische Leistungsmessung einzubetten. Dazu werden drei Themenfelder behandelt, (a) der theoretische Hintergrund des Konzepts Instruktionssensitivität, (b) die Messung von Instruktionssensitivität sowie (c) die Identifikation von weiteren Forschungsbedarfen. (DIPF/Orig.)
Abstract (english): Students' performance in assessments is regularly attributed to more or less effective teaching. Valid interpretation requires that outcomes are affected by instruction to a significant degree. Hence, instruments need to be capable of detecting effects of instruction, that is, instruments need to be instructionally sensitive. However, empirical investigation of the instructional sensitivity of tests and items is seldom in practice. In consequence, in many cases, it remains unclear whether teaching was ineffective or the instrument was insensitive.
While there is a living discussion on the instructional sensitivity of tests and items in the USA, the concept of instructional sensitivity is rather unknown in German-speaking countries. Thus, the present study aims at (a) introducing the concept of instructional sensitivity, (b) providing an overview on current approaches of measuring instructional sensitivity, and (c) identifying further research directions. (DIPF/Orig.)
DIPF-Departments: Bildungsqualität und Evaluation
-
-
Author(s): Pandarova, Irina; Schmidt, Torben; Hartig, Johannes; Boubekki, Ahcène; Jones, Roger Dale; Brefeld, Ulf
Title: Predicting the difficulty of exercise items for dynamic difficulty adaptation in adaptive language tutoring
In: International Journal of Artificial Intelligence in Education, 29 (2019) 3, S. 342-367
DOI: 10.1007/s40593-019-00180-4
URL: https://link.springer.com/article/10.1007%2Fs40593-019-00180-4
Publication Type: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Language: Englisch
Keywords: Fremdsprachenunterricht; Englischunterricht; Digitale Medien; Künstliche Intelligenz; Tutorensystem; Grammatik; Aufgabe; Zweitsprachenerwerb; Problemlösen; Schwierigkeit; Prognose; Messung; Computerunterstütztes Lernen; Schüler; Schuljahr 09; Schuljahr 10; Papier-Bleistift-Test; Gymnasium; Integrierte Gesamtschule; Item-Response-Theory; Itemanalyse; Niedersachsen; Deutschland
Abstract: Advances in computer technology and artificial intelligence create opportunities for developing adaptive language learning technologies which are sensitive to individual learner characteristics. This paper focuses on one form of adaptivity in which the difficulty of learning content is dynamically adjusted to the learner's evolving language ability. A pilot study is presented which aims to advance the (semi-)automatic difficulty scoring of grammar exercise items to be used in dynamic difficulty adaptation in an intelligent language tutoring system for practicing English tenses. In it, methods from item response theory and machine learning are combined with linguistic item analysis in order to calibrate the difficulty of an initial exercise pool of cued gap-filling items (CGFIs) and isolate CGFI features predictive of item difficulty. Multiple item features at the gap, context and CGFI levels are tested and relevant predictors are identified at all three levels. Our pilot regression models reach encouraging prediction accuracy levels which could, pending additional validation, enable the dynamic selection of newly generated items ranging from moderately easy to moderately difficult. The paper highlights further applications of the proposed methodology in the area of adapting language tutoring, item design and second language acquisition, and sketches out issues for future research. (DIPF/Orig.)
DIPF-Departments: Bildungsqualität und Evaluation
-
-
Author(s): Rose, Norman; Nagy, Gabriel; Nagengast, Benjamin; Frey, Andreas; Becker, Michael
Title: Modeling multiple item context effects with generalized linear mixed models
In: Frontiers in Developmental Psychology, (2019) , S. 10:248
DOI: 10.3389/fpsyg.2019.00248
URL: https://www.frontiersin.org/articles/10.3389/fpsyg.2019.00248/full
Publication Type: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Language: Englisch
Keywords: Test; Item; Kontext; Effekt; Modell; Datenanalyse; Deutschland
Abstract: Item context effects refer to the impact of features of a test on an examinee's item responses. These effects cannot be explained by the abilities measured by the test. Investigations typically focus on only a single type of item context effects, such as item position effects, or mode effects, thereby ignoring the fact that different item context effects might operate simultaneously. In this study, two different types of context effects were modeled simultaneously drawing on data from an item calibration study of a multidimensional computerized test (N = 1,632) assessing student competencies in mathematics, science, and reading. We present a generalized linear mixed model (GLMM) parameterization of the multidimensional Rasch model including item position effects (distinguishing between within-block position effects and block position effects), domain order effects, and the interactions between them. Results show that both types of context effects played a role, and that the moderating effect of domain orders was very strong. The findings have direct consequences for planning and applying mixed domain assessment designs. (DIPF/Orig.)
DIPF-Departments: Struktur und Steuerung des Bildungswesens
-
-
Author(s): Schmidt, Laura I.; Scheiter, Fabian; Neubauer, Andreas B.; Sieverding, Monika
Title: Anforderungen, Entscheidungsfreiräume und Stress im Studium. Erste Befunde zu Reliabilität und Validität eines Fragebogens zu strukturellen Belastungen und Ressourcen (StrukStud) in Anlehnung an den Job Content Questionnaire
In: Diagnostica, 65 (2019) 2, S. 63-74
DOI: 10.1026/0012-1924/a000213
URN: urn:nbn:de:0111-pedocs-180602
URL: http://nbn-resolving.org/urn:nbn:de:0111-pedocs-180602
Publication Type: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Language: Deutsch
Keywords: Studium; Universität; Stress; Belastung; Wohlbefinden; Gesundheit; Entscheidung; Freiheit; Unterstützung; Student; Selbsteinschätzung; Modell; Fragebogen; Psychometrie; Validität; Reliabilität; Faktorenanalyse; Itemanalyse; Empirische Untersuchung; Heidelberg; Deutschland
Abstract: Mit dem Demand-Control-Modell und dem dazugehörigen Job Content Questionnaire (JCQ) existiert im Arbeitsumfeld ein bewährtes Modell zur Vorhersage physischer und psychischer Gesundheitsrisiken. Um diese auch unter Studierenden theoriegeleitet vorhersagen zu können, passten wir den JCQ auf den Hochschulkontext an und untersuchten mittels unseres Fragebogens zu strukturellen Belastungen und Ressourcen im Studium (StrukStud) den Erklärungsbeitrag hinsichtlich Stresserleben und Wohlbefinden. In 4 Studien mit insgesamt 732 Studierenden (Psychologie, Lehramt, Soziale Arbeit, Wirtschaftsrecht und Erziehung & Bildung) wurden die Demand-Control-Dimensionen (StrukStud), Stresserleben (Heidelberger Stress-Index HEI-STRESS & Perceived Stress Questionnaire) und weitere Referenzkonstrukte wie Studienzufriedenheit und körperliche Beschwerden erfasst. Befunde zur Reliabilität und Validität werden vorgestellt. Die Ergebnisse belegen die psychometrische Qualität des StrukStud sowie dessen Potenzial zur Erklärung von Stress im Studium. Mit dem StrukStud liegt für den deutschsprachigen Raum erstmals ein ökonomisches Selbsteinschätzungsinstrument zur Erfassung von psychologischen Anforderungen und Entscheidungsfreiräumen im Studium vor.
Abstract (english): Karasek's demand-control model and the corresponding Job Content Questionnaire (JCQ) have greatly influenced research conducted on psychosocial factors at work and health. In our questionnaire on structural conditions (StrukStud), we applied the JCQ to the situation of university students in order to explore the contribution of the Karasek dimensions on outcomes such as psychological distress. In 4 studies of 732 university students (Psychology, Teaching, Social Work, Business Law, and Educational Science) we assessed the demand-control dimensions (StrukStud), stress (Heidelberg Stress Index [HEI-STRESS] and Perceived Stress Questionnaire), and related constructs such as study satisfaction and physical health complaints. Initial findings on reliability and validity are presented. Results demonstrate the psychometric properties of the StrukStud and its potential to explain study-related stress. For the German-speaking countries, the StrukStud is the first economic self-report measure for psychological demands and decision latitude in the context of higher education.
DIPF-Departments: Bildung und Entwicklung
-
-
Author(s): Wu, Qian; Debeer, Dries; Buchholz, Janine; Hartig, Johannes; Janssen, Rianne
Title: Predictors of individual performance changes related to item positions in PISA assessments
In: Large-scale Assessments in Education, (2019) , S. 7:5
DOI: 10.1186/s40536-019-0073-6
URL: https://largescaleassessmentsineducation.springeropen.com/articles/10.1186/s40536-019-0073-6
Publication Type: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Language: Englisch
Keywords: Leistungstest; Testaufgabe; Design; Wirkung; PISA <Programme for International Student Assessment>; Naturwissenschaftliche Kompetenz; Lesekompetenz; Mathematische Kompetenz; Schülerleistung; Fragebogen; Mehrebenenanalyse; Item-Response-Theory
Abstract (english): Background:
Item position effects have been a common concern in large-scale assessments as changing the order of items in booklets may have an undesired effect on test performance. If every test taker would be affected by the effect in the very same way, comparisons between groups of individuals would still be valid. However, research has shown that in addition to a general fixed effect of item positions, the extent of the effect varies considerably across individuals. These individual differences are referred to as persistence. Test takers with a high level of persistence are able to keep up their performance better throughout the test administration, whereas those with a lower level of persistence show a larger decline in their test performance.
Methods:
The present study applied a multilevel extended item response theory (IRT ) framework and used the data from the PISA 2006 science, 2009 reading, and 2012 mathematics assessments. The first objective of this study is to provide a systematic investigation of item position effects across the three PISA domains, partially replicating the previous studies on PISA 2006 and 2009. Second, this study aims to gain a better understanding of the nature of individual differences in position effects by relating them to student characteristics. Gender, socio-economic status, language spoken at home, and three motivational scales (enjoyment of doing the subject being assessed, effort thermometer, perseverance) were used as person covariates for persistence.
Results:
This study replicated and extended the results found in previous studies. An overall negative item cluster position effect and significant individual differences in this effect were found in all the countries in the three PISA domains. Furthermore, the most frequently observed effect of person covariates on persistence is gender, with girls keeping up their performance better than boys. Other predictors showed little or inconsistent effects on persistence.
Conclusions:
Our study demonstrated inter-individual differences as well as group differences in item position effects, which may threaten the comparability between persons and groups. The consequences and implications of item position effects and persistence for the interpretation of PISA results are discussed.
DIPF-Departments: Bildungsqualität und Evaluation
-
-
Author(s): Zehner, Fabian; Weis, Mirjam; Vogel, Freydis; Leutner, Detlev; Reiss, Kristina
Title: Kollaboratives Problemlösen in PISA 2015. Deutschland im Fokus
In: Zeitschrift für Erziehungswissenschaft, 22 (2019) 3, S. 617-646
DOI: 10.1007/s11618-019-00874-4
URN: urn:nbn:de:0111-pedocs-176046
URL: http://nbn-resolving.org/urn:nbn:de:0111-pedocs-176046
Publication Type: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Language: Deutsch
Keywords: Schülerleistungstest; Fragebogen; PISA <Programme for International Student Assessment>; Internationaler Vergleich; Deutschland; OECD-Länder; Schüler; Problemlösen; Kooperation; Kompetenz; Schuljahr; Schulform; Computerunterstütztes Verfahren; Simulation; Technologiebasiertes Testen; Messverfahren; Qualität; Psychometrie; Item-Response-Theory; Skalierung
Abstract: Dieser Beitrag fokussiert die Ergebnisse in Deutschland zum internationalen Vergleich kollaborativer Problemlösekompetenz bei Fünfzehnjährigen im Programme for International Student Assessment (PISA) 2015 und berichtet Ergebnisse einer Kreuzvalidierung der Skalierung. Eingesetzt wurde ein neuer computerbasierter Test, der die Schülerinnen und Schüler mit simulierten Gruppenmitgliedern Probleme lösen lässt. Daten von n = 124.994 Fünfzehnjährigen aus 51 Staaten zur kollaborativen Problemlösekompetenz wurden erhoben. Die Schülerinnen und Schüler in Deutschland weisen eine überdurchschnittliche Kompetenz auf (525 Punkte), liegen eine viertel Standardabweichung unter dem OECD-Spitzenstaat Japan (552 Punkte) und eine viertel Standardabweichung über dem OECD-Schnitt (500 Punkte). In allen Staaten weisen Mädchen höhere Werte auf als Jungen. Während der Anteil hochkompetenter Jugendlicher in Deutschland vergleichbar hoch mit den Spitzenstaaten ausfällt, erreichen 21 % nur Kompetenzstufe I oder bleiben darunter, doppelt so viele wie in Japan. Der Beitrag präsentiert zudem nationale Ergebnisse, liefert empirische Evidenz zur Qualität des Tests und diskutiert diesen kritisch. (DIPF/Orig.)
Abstract (english): Focusing on Germany, this article presents results from the international comparison of fifteen-year-olds in collaborative problem solving and a cross validation of the scaling in the Programme for International Student Assessment (PISA) 2015. A new computer-based test was used requesting students to solve a problem jointly with simulated group members. Data from collaborative problem solving of fifteen-year-olds (n = 124,994) in 51 countries were assessed. The German mean competence level (525 points) is a quarter standard deviation above the OECD average (500 points) and a quarter standard deviation below the OECD's top performing country Japan (552 points). In all participating countries, girls outperform boys. While the percentage of top-performing students in Germany is comparable to proportions in the best-performing OECD countries, 21% of the students in Germany only reach competence level I or below, twice as many as in Japan. National results are presented as well as empirical evidence on the quality of the test, which is critically discussed. (DIPF/Orig.)
DIPF-Departments: Bildungsqualität und Evaluation
-
-
Author(s): Bengs, Daniel; Brefeld, Ulf; Kröhne, Ulf
Title: Adaptive item selection under matroid constraints
In: Journal of Computerized Adaptive Testing, 6 (2018) 2, S. 15-36
DOI: 10.7333/1808-0602015
URN: urn:nbn:de:0111-dipfdocs-166953
URL: http://www.dipfdocs.de/volltexte/2020/16695/pdf/JCAT_2018_2_Bengs_Brefeld_Kroehne_Adaptive_item_selection_under_matroid_constraints_A.pdf
Publication Type: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Language: Englisch
Keywords: Adaptives Testen; Algorithmus; Computerunterstütztes Verfahren; Itembank; Messverfahren; Technologiebasiertes Testen; Testkonstruktion
Abstract (english): The shadow testing approach (STA; van der Linden & Reese, 1998) is considered the state of the art in constrained item selection for computerized adaptive tests. The present paper shows that certain types of constraints (e.g., bounds on categorical item attributes) induce a matroid on the item bank. This observation is used to devise item selection algorithms that are based on matroid optimization and lead to optimal tests, as the STA does. In particular, a single matroid constraint can be treated optimally by an efficient greedy algorithm that selects the most informative item preserving the integrity of the constraints. A simulation study shows that for applicable constraints, the optimal algorithms realize a decrease in standard error (SE) corresponding to a reduction in test length of up to 10% compared to the maximum priority index (Cheng & Chang, 2009) and up to 30% compared to Kingsbury and Zara's (1991) constrained computerized adaptive testing.
DIPF-Departments: Bildungsqualität und Evaluation