European Bank of Anchor Items for Foreign Language Skills

The EBAFLS project (European Bank of Anchor Items for Foreign Language Skills) was part of the Lingua 2 programme launched by the European Commission, targeting a preparation of diverse instruments for learning and assessing language skills. The study was carried out in eight European countries to enable the comparability of European language certificates. The German part of the study, funded by the BMBF, was co-ordinated and conducted at DIPF.

Project description

EBAFLS aims at the development of an international CEF anchor item bank to be used in linking national tests and exams to the CEF. The project is cooperated by eight countries (France, Germany, Hungary, Luxembourg, the Netherlands, Scotland, Spain and Sweden) and concerns the languages English, French and German and the skills reading and listening referring to the CEF levels A2, B1, B2. The item bank will be composed of items for reading and listening from different exams that have been used within one of the participating countries and of which it has been established that they function in the same way in the different participating countries.

The resulting mostly culturally unbiased items will show that

they are based on the same constructs
items from different countries function in a similar way concerning level, discrimination etc.
the cut-off scores of the CEF levels are the same for all countries, established by ways of standard-setting procedures in the participating countries

The outcome of the project - the bank of anchor items - can be used throughout Europe in order to link national assessment instruments to the CEFR. Thus language assessment will be transparent, reliable and valid, and each European foreign language certificate or diploma would be comparable to any other.

Background and objectives of the EBAFLS Project

During the last years there has been a growing number of research projects in the area of foreign language testing throughout the European Union. The EU as well as the European Commission both have been involved in and financed projects concerned with the assessment of foreign language comprehension. The EC has launched and supported projects such as the Language Portfolio and published the Common European Framework of Reference for Languages (EC, 2002) as well as a pilot edition of a Manual (EC, 2005) for relating tests to the CEFR. Those activities as well as the multilingualism policy of the EU have initialized a growing need to enable the comparison of language skills of people from different European countries in a culturally fair way.

The EBAFLS - Project belongs to the SOKRATES Lingua2-Programme of the European Commission and runs from October 2004 to September 2007. EBAFLS is based on the CEFR and follows in many aspects, such as internal validation and standard setting (see below), the recommendations of the Manual.

The main objective of the project is to develop CEFR-based banks of anchor items for the assessment of three foreign languages (English, French and German) and two language skills (reading and listening) 3*2= 6 item banks. The items within the banks can be used to compare foreign language skills across different countries and thus lead to culturally fair and comparable test results. Eight European countries (France, Germany, Hungary, Luxembourg, the Netherlands Scotland, Spain, Sweden) participate as partners in the project.

The project aimed at a restricted range of the CEFR essentially covering B1. This implied that the instruments constructed in the project should be able to discriminate between the main level (B1) and its two nearest neighbours on the CEFR-scale: A2 and B2.

The EBAFLS item banks can be used by countries to evaluate their own tests as an extension of the item bank, link their national examinations to the item bank to make them comparable to tests and examinations of other countries and, since the items are classified with regard to the CEFR level, automatically link their examinations to the CEF. By using this method, language certificates and diplomas will be comparable across different European countries. The linking method will be described more detailed further below.

The German Institute of International Educational Research has coordinated the German part of the study. By taking part in many project meetings, the DIPF was actively involved concerning all important project decisions and organisational tasks.

Method and Proceeding

Compiling the original itempool

The items collected and validated in the course of the project came from the 8 participating countries and cover reading and listening comprehension in three foreign languages (FR, DE, EN). Several criteria had to be met before an item was included into the item pool:

they should be linked to the CEFR and classified by national language testers to be approximately on or nearby level B1,
they should be typical for the testing culture of the delivering country,
nevertheless they should be culture-free, i.e. not measuring country-specific knowledge,
there must have been some empirical evidence for the measurement properties of the items (existing data of surveys, national examinations etc.)

Classification according to the dimensions of the Dutch CEFR Construct Project called “Dutch Grid”

Although the CEFR levels have increasingly been incorporated in assessment procedures and standards it is by no means clear if one specific CEFR level allocated by one school, institution or nation is based on the same criteria as the same level allocated by another school, institution or nation. This question is not simply answered by claiming that one task or one item is to be assigned to a certain level. It can be easily demonstrated that if several experts assign items independently – the inter-assignment reliability tends to be rather low.

Background of such problems:

The development of the CEFR was not theoretically driven. Therefore there are gaps, inconsistencies, and biases which irritate when defining the test specifications.
Many descriptors rely on terminologies which are not precise: do similar descriptors mean different things or are they stylistic variants?
Concerning the descriptions of the profiles of the levels it is not always clear enough what makes a task easier or more difficult.

Since earlier work resulted in the conclusion that the CEFR is not detailed enough in order to link foreign language test items precisely to the levels of the CEF the Dutch CEFR Construct Project has been established, the DIPF being represented in the international expert group by Günter Nold, Prof. for applied linguistics, http://www.lancs.ac.uk/fss/projects/grid/ (Alderson et al, 2006).

In the Dutch CEFR Construct Project it has been investigated whether the CEFR can help test developers construct reading and listening tests based on CEFR levels. The results revealed that the CEFR scales together with the detailed description of language use contained in the CEFR are not sufficient to guide test development at these various levels. The project methodology involved gathering expert judgments on the usability of the CEFR for test construction, identifying what might be missing from the CEFR, developing a frame for analysis of tests and specifications, and examining a range of existing test specifications and guidelines to item writers and sample test tasks for different languages at the 6 levels of the CEFR. Outcomes included a critical review of the CEFR, a set of compilations of CEFR scales and of test specifications at the different CEFR levels, and a series of frameworks or classification systems, which led to a Web-mounted instrument known as the Dutch CEFR Grid. Interanalyst agreement in using the Grid for analyzing test tasks was quite promising, but the Grid needs to be improved by training and discussion before decisions on test task levels are made. However, identifying separate CEFR levels is at least as much an empirical matter as it is a question of test content, either determined by test specifications or identified by any content classification system or grid.

The Dutch Grid is a tool in order to specify items and tasks according to

text source (personal, public, occupational, education and training)
Authenticity (genuine, adapted/simplified, pedagogic)
text type/discourse type (descriptive, narrative, expository, argumentative, instructive)
domains (personal, public, work, education)
communication topics (personal identification, house, home environment, daily life, free time, entertainment, travel, relations with other people, health and body care, education and training, shopping, food and drink, services, language, weather)
degree of abstraction of texts /nature of content
vocabulary (only frequent v. --- extended v.)
grammar (only frequent g. --- extended g.)
text length

The Dutch Grid identifies the question types as:

selected response (MC, True-false, combination, sequencing, citing)
short structured response (short answer, cloze, gap-filling, complete a question, complete a summary)
extended constructed response (essay, summary, justify in own words)

and categorizes items concerning the operations to be applied by the testee:

task dimension: recognize and retrieve, make inferences, evaluate
explicitness dimension: explicit, implicit
content dimension: main ideas, gist /broad outlines, details, opinion, attitude of the author, conclusions, communicative objective, text structure /relations between test parts.

Thus, with the help of the Dutch Grid, items can be described with regard to certain item and task properties. This simplifies classifying items into the levels of the CEFR and be more specific about their assumed difficulties.

Finally, before being accepted as part of the EBAFLS item pool, the items have been classified into the CEFR levels as well as checked again with regard to “formal” mistakes (such as grammar or spelling mistakes) by native speakers.

Testing the items

The items for the assessment of each of the three languages have been tested in the participating countries. Each country tested items for two of the three languages. The assessment of reading comprehension took place in May/June 2006, the assessment of listening comprehension in February 2007. First results are to be expected by the end of September 2007 in the final project report. Operationalization of test and design are in accordance with the conventions of international large scale studies. The study aims at students who are approximately on level B1 of the CEFR, which is estimated to be the level of competence students should be capable of at the end of compulsory education. Depending on country and language, the students were enrolled from 9th to 11th grades.

DIF and dealing with it in EBAFLS

In order to find out whether an item is culturally biased, the items have been checked for “Differential Item Functions” (DIF) which can be defined as follows: DIF occurs when people from different groups (commonly gender or ethnicity) with the same latent trait (the same ability/skill) have a different probability to give a certain response on a questionnaire or test. With DIF analysis is explored what the amount of DIF is for each item of a test. An item (…) displays DIF if people from different groups in spite of their same underlying true ability have a different probability to give a certain response.” (Embretson & Reise, 2000).

Standard Setting

Standard Setting is one of the steps recommended in the Manual for linking tests to the CEFR. It is an important part of the validation process. There are two different kinds of standard setting: item-centred standard setting and examinee-centred standard setting. For the item-centred standard setting language experts in every participating country are trained on the scales of the CEFR and are asked to judge all items with regard to their CEFR level. The difficulty of the items as judged by the experts is then compared with the empirically found item difficulty of the study. Furthermore it is being checked whether experts from the different countries assign the items to the same CEFR level. For the examinee-centred standard setting, teachers are asked to assign students to CEFR levels. Those results are then compared to the empirical results of the study.

Outcome

The project aims at the following concrete outcomes:

Six item banks, one per language and skill (English reading and listening, French reading and listening, German reading and listening)
A manual for institutions describing the use and function of the item bank
A report on validity and reliability of the items and instruments and guidelines how to use and enlarge the item banks as well as on how to develop additional item banks.

Publications

Alderson, C. (Ed.) (2005). Language Assessment in Europe (Special Issue). Language Testing, 22 (3).
Alderson, J.C., Figueras, N., Kuijper, H., Nold, G., Takala, S & Tardieu, C. (2006). Analysing Tests of Reading and Listening in Relation to the Common European Framework of Reference: The Experience of the Dutch CEFR Construct Project. Language Assessment Quarterly, 3 (1), 3-30.
Europäische Kommission (2003). Mitteilung der Kommission an den Rat, das Europäische Parlament, den Wirtschafts- und Sozialausschuss und den Ausschuss der Regionen vom 24. Juli 2003 – Förderung des Sprachenlernens und der Sprachenvielfalt: Aktionsplan 2004 – 2006. Komm(2003)449 endgültig.
Europäische Kommission (2005). Mitteilung der Kommission vom 1. August 2005 - Europäischer Indikator für Sprachenkompetenz [KOM(2005) 356 endg. - nicht im Amtsblatt veröffentlicht].
Europäischer Rat (2002).Tagung des Europäischen Rates von Barcelona, 15. und 16. März 2002, Schlussfolgerungen des Vorsitzes, Absatz 44. Zugriff am 20.03.07 unter http://www.bologna-berlin2003.de/pdf/Schluss_Rat_Barcelona.pdf.
Europarat (2001): Gemeinsamer Europäischer Referenzrahmen für Sprachen: lernen, lehren, beurteilen. Berlin: Langenscheidt.
Figueras, N., North, B., Takala, S., Verhelst, N., & Van Avermaet, P. (2005). Relating Examinations to the Common European Framework: a manual. Language Testing, 22 (3), 257-261.
Maris, Gunter (2005): EBAFLS: Results from the pilot study. Unveröffentlichtes Manuskript, CITO.
Nojons, J. & Kuijper, H. (2006). Report of a research project commissioned by the Dutch Ministry of Education, Culture and Science. Unveröffentlichtes Manuskript, CITO.
Special Eurobarometer 63.4. (2005.) Europeans and their languages. TNS opinion & social. Zugriff am 20.03.07 unter http://ec.europa.eu/public_opinion/archives/ebs/ebs_237.en.pdf.
Holland, P. & Wainer, H. (Eds.) (1993). Differential Items Functioning. Hillsdale, New Jersey: Lawrence Erlbaum.

Project Details

Status:	Completed Projects
Department:	Teacher and Teaching Quality
Duration:	2004 – 2007
Funding:	External funding