Information Extraction From Spoken and Informal Language

In this project various aspects of extracting and using information based on data that contains spoken or informal language are examined. These aspects cover among others elements such as classification (What makes a good answer?) Segmentation and Keyphrase extraction (in the context of transcripts of school lessons) and Summarization of data for Eduserver. In each sub project either the source of the data is from the educational domain or the goal is providing information or tools to researchers in the area of educational research.

Automatic information extraction from transcribed video data

The Pythagoras Dataset contains transcribed speech data from over 100 school lessons dealing with the Pythagorean Theorem. This offers the possibility to examine available Natural Language Processing (NLP) methods and evaluate their performance on this type of data. Spoken language is difficult for automatic processing as it differs substantially from written text on which the methods are trained and for which they have been developed. One of the major differences is that sentences are very often ungrammatical and filled with hesitations and pauses (disfluencies). Another problematic phenomenon is crosstalk where the speakers interrupt each other, leaving sentences incomplete or finishing each other’s sentences. This increases the difficulty for automatic processing. Nevertheless, we want to evaluate how good or bad available methods perform on this type of data. Our focus currently is the extraction of keywords and key phrases. If successful this would allow searching the underlying data. Another issue is the segmentation of lessons into various parts based on the interaction in the classroom and the situation during the lesson. If successful, this could be used to support the classification of the situations in the classroom as detailed in Klieme(2006)

What makes a good answer

Another type of informal language are forum posts. Here, the language tends to be filled with abbreviations, character combinations etc. that are not used otherwise. Using the data from Stack Exchange which is freely available, we want to explore how we can automatically detect good answers. Stack exchange has a lively community where answers, but also questions are voted for. This data allows us to build a system, which can learn what characterizes a good answer, not only in a binary way, but also gradual changes. We hope to apply this to projects in the educational domain, for example in the infoblog. The problems here lie both in the domain of the data and the language. Stack Exchange offers various forums of various topics, but the largest by far is Stack Overflow, which offers questions and answers from the domain of programming. Therefore, the data is filled with code-snippets and domain specific vocabulary. Our system aims to first find methods that are good at predicting good answers in this domain. To this end, we will explore a wide range of features, many of which are probably unique to the specific domain. But in a subsequent step we want to explore features that can be applied to other domains. Stack Exchange is very suitable for this as it also offers forums for groups interested in for example cooking, traveling, English language learning and physics.     

Summarization of data for the Eduserver

Another aspect in this context is the summarization of data. The German Eduserver hosts a big amount of manually curated links and information from the domain of educational science and of relevance both for researchers and practitioners. Human curators prepare the summaries for linked sources, such as websites, books and articles. This is a time-consuming and tedious task. These summaries should guide users of the Eduserver to links which are of interest to them. The quality of the background material varies considerably, as the data comes from various sources. Using NLP methods we aim at creating a frame work, which gives suggestions about a potential summarization to support the human curator. Initial results obtained through a master’s thesis at the technical university Darmstadt in collaboration with DIPF indicate that it is feasible to provide the human curators with sentences to use in a final summary and that indeed it is helpful in creating a final summary. As these are only initial results, we want to build on the framework developed in the course of the master’s thesis to improve the quality, but also to compare our system against other systems, using standard evaluation metrics. Here, we can make use of the year-long tradition in the NLP community of competitions in certain tasks. One of these was the Document Understanding Competition, which later became the Text Analysis Competition. Through these competitions big data sets of various tasks (single document summarization, multi-document summarization, short and very short summaries, but also update summaries) and reference data to compare results of a specific system are available. These competitions also helped to develop standard evaluation tools, which we will use to evaluate our system.


UKP TU Darmstadt


The project is funded by the general DIPF budget.

Project Details

Completed Projects
Department: Information Centre for Education
2013 – 2015