Ergebnis der Suche in der DIPF Publikationendatenbank

Ihre Abfrage:

(Schlagwörter: "Item-Response-Theory")

Cross-cultural comparability of noncognitive constructs in TIMSS and PISA He, Jia; Barrera-Pedemonte, Fabian; Buchholz, Janine Zeitschriftenbeitrag | In: Assessment in Education | 2019 38403 Endnote: Autor*innen: He, Jia; Barrera-Pedemonte, Fabian; Buchholz, Janine
Titel: Cross-cultural comparability of noncognitive constructs in TIMSS and PISA
In: Assessment in Education, 26 (2019) 4, S. 369-385
DOI: 10.1080/0969594X.2018.1469467
URL: https://www.tandfonline.com/doi/full/10.1080/0969594X.2018.1469467
Dokumenttyp: 3a. Beiträge in begutachteten Zeitschriften; Beitrag in Sonderheft
Sprache: Englisch
Schlagwörter: PISA <Programme for International Student Assessment>; TIMSS <Third International Mathematics and Science Study>; Schülerleistung; Leistungsmessung; Mathematikunterricht; Naturwissenschaftlicher Unterricht; Freude; Motivation; Schule; Identifikation <Psy>; Sekundarstufe I; Schüler; Messverfahren; Vergleich; Item-Response-Theory; Faktorenanalyse; OECD-Länder
Abstract: Noncognitive assessments in Programme for International Student Assessment (PISA) and Trends in International Mathematics and Science Study share certain similarities and provide complementary information, yet their comparability is seldom checked and convergence not sought. We made use of student self-report data of Instrumental Motivation, Enjoyment of Science and Sense of Belonging to School targeted in both surveys in 29 overlapping countries to (1) demonstrate levels of measurement comparability, (2) check convergence of different scaling methods within survey and (3) check convergence of these constructs with student achievement across surveys. We found that the three scales in either survey (except Sense of Belonging to School in PISA) reached at least metric invariance. The scale scores from the multigroup confirmatory factor analysis and the item response theory analysis were highly correlated, pointing to robustness of scaling methods. The correlations between each construct and achievement was generally positive within each culture in each survey, and the correlational pattern was similar across surveys (except for Sense of Belonging), indicating certain convergence in the cross-survey validation. We stress the importance of checking measurement invariance before making comparative inferences, and we discuss implications on the quality and relevance of these constructs in understating learning. (DIPF/Orig.)
DIPF-Abteilung: Bildungsqualität und Evaluation

Construct equivalence of PISA reading comprehension measured with paper‐based and computer‐based […] Kroehne, Ulf; Buerger, Sarah; Hahnel, Carolin; Goldhammer, Frank Zeitschriftenbeitrag | In: Educational Measurement | 2019 39814 Endnote: Autor*innen: Kroehne, Ulf; Buerger, Sarah; Hahnel, Carolin; Goldhammer, Frank
Titel: Construct equivalence of PISA reading comprehension measured with paper‐based and computer‐based assessments
In: Educational Measurement, 38 (2019) 3, S. 97-111
DOI: 10.1111/emip.12280
URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/emip.12280
Dokumenttyp: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Sprache: Englisch
Schlagwörter: Einflussfaktor; Schülerleistung; Frage; Antwort; Interaktion; Unterschied; Vergleich; Item-Response-Theory; Deutschland; PISA <Programme for International Student Assessment>; Leseverstehen; Messverfahren; Testkonstruktion; Korrelation; Äquivalenz; Papier-Bleistift-Test; Computerunterstütztes Verfahren; Technologiebasiertes Testen; Leistungsmessung; Testverfahren; Testdurchführung
Abstract: For many years, reading comprehension in the Programme for International Student Assessment (PISA) was measured via paper‐based assessment (PBA). In the 2015 cycle, computer‐based assessment (CBA) was introduced, raising the question of whether central equivalence criteria required for a valid interpretation of the results are fulfilled. As an extension of the PISA 2012 main study in Germany, a random subsample of two intact PISA reading clusters, either computerized or paper‐based, was assessed using a random group design with an additional within‐subject variation. The results are in line with the hypothesis of construct equivalence. That is, the latent cross‐mode correlation of PISA reading comprehension was not significantly different from the expected correlation between the two clusters. Significant mode effects on item difficulties were observed for a small number of items only. Interindividual differences found in mode effects were negatively correlated with reading comprehension, but were not predicted by basic computer skills or gender. Further differences between modes were found with respect to the number of missing values.
Abstract (english): For many years, reading comprehension in the Programme for International Student Assessment (PISA) was measured via paper‐based assessment (PBA). In the 2015 cycle, computer‐based assessment (CBA) was introduced, raising the question of whether central equivalence criteria required for a valid interpretation of the results are fulfilled. As an extension of the PISA 2012 main study in Germany, a random subsample of two intact PISA reading clusters, either computerized or paper‐based, was assessed using a random group design with an additional within‐subject variation. The results are in line with the hypothesis of construct equivalence. That is, the latent cross‐mode correlation of PISA reading comprehension was not significantly different from the expected correlation between the two clusters. Significant mode effects on item difficulties were observed for a small number of items only. Interindividual differences found in mode effects were negatively correlated with reading comprehension, but were not predicted by basic computer skills or gender. Further differences between modes were found with respect to the number of missing values.
DIPF-Abteilung: Bildungsqualität und Evaluation

Invariance of the response processes between gender and modes in an assessment of reading Kroehne, Ulf; Hahnel, Carolin; Goldhammer, Frank Zeitschriftenbeitrag | In: Frontiers in Applied Mathematics and Statistics | 2019 39231 Endnote: Autor*innen: Kroehne, Ulf; Hahnel, Carolin; Goldhammer, Frank
Titel: Invariance of the response processes between gender and modes in an assessment of reading
In: Frontiers in Applied Mathematics and Statistics, (2019) , S. 5:2
DOI: 10.3389/fams.2019.00002
URL: https://www.frontiersin.org/articles/10.3389/fams.2019.00002/full
Dokumenttyp: 3a. Beiträge in begutachteten Zeitschriften; Beitrag in Sonderheft
Sprache: Englisch
Schlagwörter: Lesefertigkeit; Technologiebasiertes Testen; Computerunterstütztes Verfahren; Papier-Bleistift-Test; Antwort; Zeit; Messung; Item-Response-Theory; Modell; Geschlechtsspezifischer Unterschied; Logdatei; Datenanalyse; Empirische Untersuchung; Deutschland
Abstract: In this paper, we developed a method to extract item-level response times from log data that are available in computer-based assessments (CBA) and paper-based assessments (PBA) with digital pens. Based on response times that were extracted using only time differences between responses, we used the bivariate generalized linear IRT model framework (B-GLIRT, [1]) to investigate response times as indicators for response processes. A parameterization that includes an interaction between the latent speed factor and the latent ability factor in the cross-relation function was found to fit the data best in CBA and PBA. Data were collected with a within-subject design in a national add-on study to PISA 2012 administering two clusters of PISA 2009 reading units. After investigating the invariance of the measurement models for ability and speed between boys and girls, we found the expected gender effect in reading ability to coincide with a gender effect in speed in CBA. Taking this result as indication for the validity of the time measures extracted from time differences between responses, we analyzed the PBA data and found the same gender effects for ability and speed. Analyzing PBA and CBA data together we identified the ability mode effect as the latent difference between reading measured in CBA and PBA. Similar to the gender effect the mode effect in ability was observed together with a difference in the latent speed between modes. However, while the relationship between speed and ability is identical for boys and girls we found hints for mode differences in the estimated parameters of the cross-relation function used in the B-GLIRT model. (DIPF/Orig.)
DIPF-Abteilung: Bildungsqualität und Evaluation

Instruktionssensitivität von Tests und Items Naumann, Alexander; Musow, Stephanie; Aichele, Christine; Hochweber, Jan; Hartig, Johannes Zeitschriftenbeitrag | In: Zeitschrift für Erziehungswissenschaft | 2019 38407 Endnote: Autor*innen: Naumann, Alexander; Musow, Stephanie; Aichele, Christine; Hochweber, Jan; Hartig, Johannes
Titel: Instruktionssensitivität von Tests und Items
In: Zeitschrift für Erziehungswissenschaft, 22 (2019) 1, S. 181-202
DOI: 10.1007/s11618-018-0832-0
URL: https://link.springer.com/article/10.1007%2Fs11618-018-0832-0
Dokumenttyp: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Sprache: Deutsch
Schlagwörter: Unterricht; Effektivität; Schülerleistung; Leistungsmessung; Test; Messverfahren; Empirische Forschung; Konzeption; Validität; Daten; Interpretation; Psychometrie; Item-Response-Theory; Modell
Abstract: Testergebnisse von Schülerinnen und Schülern dienen regelmäßig als ein zentrales Kriterium für die Beurteilung der Effektivität von Schule und Unterricht. Gültige Rückschlüsse über Schule und Unterricht setzen voraus, dass die eingesetzten Testinstrumente mögliche Effekte des Unterrichts auffangen können, also instruktionssensitiv sind. Jedoch wird diese Voraussetzung nur selten empirisch überprüft. Somit bleibt mitunter unklar, ob ein Test nicht instruktionssensitiv oder ein Unterricht nicht effektiv war. Die Klärung dieser Frage erfordert die empirische Untersuchung der Instruktionssensitivität der eingesetzten Tests und Items. Während die Instruktionssensitivität in den USA bereits seit Langem diskutiert wird, findet das Konzept im deutschsprachigen Diskurs bislang nur wenig Beachtung. Unsere Arbeit zielt daher darauf ab, das Konzept Instruktionssensitivität in den deutschsprachigen Diskurs über schulische Leistungsmessung einzubetten. Dazu werden drei Themenfelder behandelt, (a) der theoretische Hintergrund des Konzepts Instruktionssensitivität, (b) die Messung von Instruktionssensitivität sowie (c) die Identifikation von weiteren Forschungsbedarfen. (DIPF/Orig.)
Abstract (english): Students' performance in assessments is regularly attributed to more or less effective teaching. Valid interpretation requires that outcomes are affected by instruction to a significant degree. Hence, instruments need to be capable of detecting effects of instruction, that is, instruments need to be instructionally sensitive. However, empirical investigation of the instructional sensitivity of tests and items is seldom in practice. In consequence, in many cases, it remains unclear whether teaching was ineffective or the instrument was insensitive. While there is a living discussion on the instructional sensitivity of tests and items in the USA, the concept of instructional sensitivity is rather unknown in German-speaking countries. Thus, the present study aims at (a) introducing the concept of instructional sensitivity, (b) providing an overview on current approaches of measuring instructional sensitivity, and (c) identifying further research directions. (DIPF/Orig.)
DIPF-Abteilung: Bildungsqualität und Evaluation

Predicting the difficulty of exercise items for dynamic difficulty adaptation in adaptive language […] Pandarova, Irina; Schmidt, Torben; Hartig, Johannes; Boubekki, Ahcène; Jones, Roger Dale; […] Zeitschriftenbeitrag | In: International Journal of Artificial Intelligence in Education | 2019 39472 Endnote: Autor*innen: Pandarova, Irina; Schmidt, Torben; Hartig, Johannes; Boubekki, Ahcène; Jones, Roger Dale; Brefeld, Ulf
Titel: Predicting the difficulty of exercise items for dynamic difficulty adaptation in adaptive language tutoring
In: International Journal of Artificial Intelligence in Education, 29 (2019) 3, S. 342-367
DOI: 10.1007/s40593-019-00180-4
URL: https://link.springer.com/article/10.1007%2Fs40593-019-00180-4
Dokumenttyp: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Sprache: Englisch
Schlagwörter: Fremdsprachenunterricht; Englischunterricht; Digitale Medien; Künstliche Intelligenz; Tutorensystem; Grammatik; Aufgabe; Zweitsprachenerwerb; Problemlösen; Schwierigkeit; Prognose; Messung; Computerunterstütztes Lernen; Schüler; Schuljahr 09; Schuljahr 10; Papier-Bleistift-Test; Gymnasium; Integrierte Gesamtschule; Item-Response-Theory; Itemanalyse; Niedersachsen; Deutschland
Abstract: Advances in computer technology and artificial intelligence create opportunities for developing adaptive language learning technologies which are sensitive to individual learner characteristics. This paper focuses on one form of adaptivity in which the difficulty of learning content is dynamically adjusted to the learner's evolving language ability. A pilot study is presented which aims to advance the (semi-)automatic difficulty scoring of grammar exercise items to be used in dynamic difficulty adaptation in an intelligent language tutoring system for practicing English tenses. In it, methods from item response theory and machine learning are combined with linguistic item analysis in order to calibrate the difficulty of an initial exercise pool of cued gap-filling items (CGFIs) and isolate CGFI features predictive of item difficulty. Multiple item features at the gap, context and CGFI levels are tested and relevant predictors are identified at all three levels. Our pilot regression models reach encouraging prediction accuracy levels which could, pending additional validation, enable the dynamic selection of newly generated items ranging from moderately easy to moderately difficult. The paper highlights further applications of the proposed methodology in the area of adapting language tutoring, item design and second language acquisition, and sketches out issues for future research. (DIPF/Orig.)
DIPF-Abteilung: Bildungsqualität und Evaluation

Predictors of individual performance changes related to item positions in PISA assessments Wu, Qian; Debeer, Dries; Buchholz, Janine; Hartig, Johannes; Janssen, Rianne Zeitschriftenbeitrag | In: Large-scale Assessments in Education | 2019 39021 Endnote: Autor*innen: Wu, Qian; Debeer, Dries; Buchholz, Janine; Hartig, Johannes; Janssen, Rianne
Titel: Predictors of individual performance changes related to item positions in PISA assessments
In: Large-scale Assessments in Education, (2019) , S. 7:5
DOI: 10.1186/s40536-019-0073-6
URL: https://largescaleassessmentsineducation.springeropen.com/articles/10.1186/s40536-019-0073-6
Dokumenttyp: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Sprache: Englisch
Schlagwörter: Leistungstest; Testaufgabe; Design; Wirkung; PISA <Programme for International Student Assessment>; Naturwissenschaftliche Kompetenz; Lesekompetenz; Mathematische Kompetenz; Schülerleistung; Fragebogen; Mehrebenenanalyse; Item-Response-Theory
Abstract (english): Background: Item position effects have been a common concern in large-scale assessments as changing the order of items in booklets may have an undesired effect on test performance. If every test taker would be affected by the effect in the very same way, comparisons between groups of individuals would still be valid. However, research has shown that in addition to a general fixed effect of item positions, the extent of the effect varies considerably across individuals. These individual differences are referred to as persistence. Test takers with a high level of persistence are able to keep up their performance better throughout the test administration, whereas those with a lower level of persistence show a larger decline in their test performance. Methods: The present study applied a multilevel extended item response theory (IRT ) framework and used the data from the PISA 2006 science, 2009 reading, and 2012 mathematics assessments. The first objective of this study is to provide a systematic investigation of item position effects across the three PISA domains, partially replicating the previous studies on PISA 2006 and 2009. Second, this study aims to gain a better understanding of the nature of individual differences in position effects by relating them to student characteristics. Gender, socio-economic status, language spoken at home, and three motivational scales (enjoyment of doing the subject being assessed, effort thermometer, perseverance) were used as person covariates for persistence. Results: This study replicated and extended the results found in previous studies. An overall negative item cluster position effect and significant individual differences in this effect were found in all the countries in the three PISA domains. Furthermore, the most frequently observed effect of person covariates on persistence is gender, with girls keeping up their performance better than boys. Other predictors showed little or inconsistent effects on persistence. Conclusions: Our study demonstrated inter-individual differences as well as group differences in item position effects, which may threaten the comparability between persons and groups. The consequences and implications of item position effects and persistence for the interpretation of PISA results are discussed.
DIPF-Abteilung: Bildungsqualität und Evaluation

Kollaboratives Problemlösen in PISA 2015. Deutschland im Fokus Zehner, Fabian; Weis, Mirjam; Vogel, Freydis; Leutner, Detlev; Reiss, Kristina Zeitschriftenbeitrag | In: Zeitschrift für Erziehungswissenschaft | 2019 39123 Endnote: Autor*innen: Zehner, Fabian; Weis, Mirjam; Vogel, Freydis; Leutner, Detlev; Reiss, Kristina
Titel: Kollaboratives Problemlösen in PISA 2015. Deutschland im Fokus
In: Zeitschrift für Erziehungswissenschaft, 22 (2019) 3, S. 617-646
DOI: 10.1007/s11618-019-00874-4
URN: urn:nbn:de:0111-pedocs-176046
URL: http://nbn-resolving.org/urn:nbn:de:0111-pedocs-176046
Dokumenttyp: 3a. Beiträge in begutachteten Zeitschriften; Aufsatz (keine besondere Kategorie)
Sprache: Deutsch
Schlagwörter: Schülerleistungstest; Fragebogen; PISA <Programme for International Student Assessment>; Internationaler Vergleich; Deutschland; OECD-Länder; Schüler; Problemlösen; Kooperation; Kompetenz; Schuljahr; Schulform; Computerunterstütztes Verfahren; Simulation; Technologiebasiertes Testen; Messverfahren; Qualität; Psychometrie; Item-Response-Theory; Skalierung
Abstract: Dieser Beitrag fokussiert die Ergebnisse in Deutschland zum internationalen Vergleich kollaborativer Problemlösekompetenz bei Fünfzehnjährigen im Programme for International Student Assessment (PISA) 2015 und berichtet Ergebnisse einer Kreuzvalidierung der Skalierung. Eingesetzt wurde ein neuer computerbasierter Test, der die Schülerinnen und Schüler mit simulierten Gruppenmitgliedern Probleme lösen lässt. Daten von n = 124.994 Fünfzehnjährigen aus 51 Staaten zur kollaborativen Problemlösekompetenz wurden erhoben. Die Schülerinnen und Schüler in Deutschland weisen eine überdurchschnittliche Kompetenz auf (525 Punkte), liegen eine viertel Standardabweichung unter dem OECD-Spitzenstaat Japan (552 Punkte) und eine viertel Standardabweichung über dem OECD-Schnitt (500 Punkte). In allen Staaten weisen Mädchen höhere Werte auf als Jungen. Während der Anteil hochkompetenter Jugendlicher in Deutschland vergleichbar hoch mit den Spitzenstaaten ausfällt, erreichen 21 % nur Kompetenzstufe I oder bleiben darunter, doppelt so viele wie in Japan. Der Beitrag präsentiert zudem nationale Ergebnisse, liefert empirische Evidenz zur Qualität des Tests und diskutiert diesen kritisch. (DIPF/Orig.)
Abstract (english): Focusing on Germany, this article presents results from the international comparison of fifteen-year-olds in collaborative problem solving and a cross validation of the scaling in the Programme for International Student Assessment (PISA) 2015. A new computer-based test was used requesting students to solve a problem jointly with simulated group members. Data from collaborative problem solving of fifteen-year-olds (n = 124,994) in 51 countries were assessed. The German mean competence level (525 points) is a quarter standard deviation above the OECD average (500 points) and a quarter standard deviation below the OECD's top performing country Japan (552 points). In all participating countries, girls outperform boys. While the percentage of top-performing students in Germany is comparable to proportions in the best-performing OECD countries, 21% of the students in Germany only reach competence level I or below, twice as many as in Japan. National results are presented as well as empirical evidence on the quality of the test, which is critically discussed. (DIPF/Orig.)
DIPF-Abteilung: Bildungsqualität und Evaluation

The impact of ignoring the partially compensatory relation between ability dimensions on […] Buchholz, Janine; Hartig, Johannes Zeitschriftenbeitrag | In: Psychological Test and Assessment Modeling | 2018 38692 Endnote: Autor*innen: Buchholz, Janine; Hartig, Johannes
Titel: The impact of ignoring the partially compensatory relation between ability dimensions on norm-referenced test scores
In: Psychological Test and Assessment Modeling, 60 (2018) 3, S. 369-385
URL: https://www.psychologie-aktuell.com/fileadmin/Redaktion/Journale/ptam_3-2018_369-385.pdf
Dokumenttyp: 3a. Beiträge in begutachteten Zeitschriften; Beitrag in Sonderheft
Sprache: Englisch
Schlagwörter: Schülerleistung; Leistungsmessung; Test; Interpretation; Item-Response-Theory; Modell; Methode; Validität; Mathematische Kompetenz; Sprachfertigkeit; Simulation; Empirische Untersuchung
Abstract (english): The IRT models most commonly employed to estimate within-item multidimensionality are compensatory and suggest that some dimensions (e.g., traits or abilities) can make up for a lack in others. However, many assessment frameworks in educational large-scale assessments suggest partially compensatory relations among dimensions. In two Monte-Carlo simulation studies we varied the loading pattern, the latent correlation between dimensions and the ability distribution to evaluate the impact on test scores when a compensatory model is incorrectly applied onto partially compensatory data. Findings imply only negligible effects when true abilities are bivariate normal. Assuming a uniform distribution, however, analyses of differences in test scores demonstrated systematic effects for specific patterns of true ability: High abilities are largely underestimated when the other ability required to solve some of the items was low. These findings highlight the necessity of applying the partially compensatory model under data conditions likely to occur in educational large-scale assessments. (DIPF/Orig.)
DIPF-Abteilung: Bildungsqualität und Evaluation

Response time-based treatment of omitted responses in computer-based testing Frey, Andreas; Spoden, Christian; Goldhammer, Frank; Wenzel, S. Franziska C. Zeitschriftenbeitrag | In: Behaviormetrika | 2018 38894 Endnote: Autor*innen: Frey, Andreas; Spoden, Christian; Goldhammer, Frank; Wenzel, S. Franziska C.
Titel: Response time-based treatment of omitted responses in computer-based testing
In: Behaviormetrika, 45 (2018) 2, S. 505-526
DOI: 10.1007/s41237-018-0073-9
Dokumenttyp: 3a. Beiträge in begutachteten Zeitschriften; Beitrag in Sonderheft
Sprache: Englisch
Schlagwörter: Methode; Technologiebasiertes Testen; Antwort; Dauer; Verhalten; Item-Response-Theory; Fehlende Daten; Datenanalyse; Testaufgabe; Typologie; Medienkompetenz; Schülerleistungstest; Testauswertung
Abstract: A new response time-based method for coding omitted item responses in computer-based testing is introduced and illustrated with empirical data. The new method is derived from the theory of missing data problems of Rubin and colleagues and embedded in an item response theory framework. Its basic idea is using item response times to statistically test for each individual item whether omitted responses are missing completely at random (MCAR) or missing due to a lack of ability and, thus, not at random (MNAR) with fixed type-1 and type-2 error levels. If the MCAR hypothesis is maintained, omitted responses are coded as not administered (NA), and as incorrect (0) otherwise. The empirical illustration draws from the responses given by N = 766 students to 70 items of a computer-based ICT skills test. The new method is compared with the two common deterministic methods of scoring omitted responses as 0 or as NA. In result, response time thresholds from 18 to 58 s were identified. With 61%, more omitted responses were recoded into 0 than into NA (39%). The differences in difficulty were larger when the new method was compared to deterministically scoring omitted responses as NA compared to scoring omitted responses as 0. The variances and reliabilities obtained under the three methods showed small differences. The paper concludes with a discussion of the practical relevance of the observed effect sizes, and with recommendations for the practical use of the new method as a method to be applied in the early stage of data processing. (DIPF/Orig.)
DIPF-Abteilung: Bildungsqualität und Evaluation

Isn't something missing? Latent variable models accounting for item nonresponse Köhler, Carmen Monographie | Berlin: Freie Universität | 2017 37162 Endnote: Autor*innen: Köhler, Carmen
Titel: Isn't something missing? Latent variable models accounting for item nonresponse
Erscheinungsvermerk: Berlin: Freie Universität, 2017
URN: urn:nbn:de:kobv:188-fudissthesis000000103203-8
URL: http://www.diss.fu-berlin.de/diss/receive/FUDISS_thesis_000000103203
Dokumenttyp: 1. Monographien (Autorenschaft); Monographie
Sprache: Englisch
Schlagwörter: Empirische Forschung; Evaluation; Fehlende Daten; Item-Response-Theory; Kompetenz; Leistungsmessung; Modell; Schülerleistung; Schülerleistungstest; Statistische Methode; Testauswertung
Abstract: Item nonresponse in competence tests pose a threat to a valid and reliable competence measurement, especially if the missing values occur systematically and relate to the unobserved response. This is often the case in the context of large-scale assessments, where the failure to respond to an item relates to examinee ability. Researchers developed methods that consider the dependency between ability and item nonresponse by incorporating a model for the process that causes missing values into the measurement model for ability. These model-based approaches seem very promising and might prove superior to common missing data approaches, which typically fail at taking the dependency between ability and nonresponse into account. Up to this point, the approaches have barely been investigated in terms of applicability and performance with regard to the scaling of competence tests in large-scale assessments. The current dissertation bridges the gap between these theoretically postulated models and their possible implementation in the context of large-scale assessments. It aims at (1) testing the applicability of model-based approaches to competence test data, and (2) evaluating whether and under what missing data conditions these approaches are superior to common missing data approaches. Three research studies were conducted for this purpose. Study 1 investigated the assumptions of model-based approaches, whether they hold in empirical practice, and how violations to those assumptions affect individual person parameters. Study 2 focused on features of examinees' nonresponse behavior, such as its stability across different competence tests and how it relates to other examinee characteristics. Study 3 examined the performance of model-based approaches compared to other approaches. Results demonstrate that model-based approaches can be applied to large-scale assessment data, though slight extensions of the models might enhance accuracy in parameter estimates. Further, persons' tendencies not to respond can be considered person-specific attributes, which are relatively constant across different competence tests and also relate to other stable person characteristics. Findings from the third study confirmed the superiority of the model-based approaches compared to common missing data approaches, although a model that simply ignores missing values also led to acceptable results. Model-based approaches show serval advantages over common missing data approaches. Considering their complexity, however, the benefits and drawbacks from different methods need to be weighed. Important issues in the debate on an appropriate scaling method concern model complexity, consequences on examinees' test-taking behavior, and precision of parameter estimates. For many large-scale assessments, a change in the missing data treatment is clearly necessary. Whether model-based approaches will replace former methods is yet to be determined. They certainly count amongst the most advanced methods to handle missing values in the scaling of competence tests. (DIPF/Orig.)
DIPF-Abteilung: Bildungsqualität und Evaluation