Logo: Deutsches Institut für Internationale Pädagogische Forschung

Forschung

Publikationendatenbank

Treffer anzeigen

Autor:
Goldhahn, Dirk; Remus, Steffen; Quasthoff, Uwe; Biemann, Chris:

Titel:
Top-level domain crawling for producing comprehensive monolingual corpora from the web

Quelle:
In: Kupietz, Marc;Biber, Hanno;Lüngen, Harald;Banski, Piotr;Breiteneder, Evelyn;Mörth, Karlheinz;Witt, Andreas;Takhsha, Jani (Hrsg.): Proceedings of the LREC-14 Workshop on Challenges in the management of Large Corpora (CMLC-2) Reykjavik : European Language Resources Association (2014) , 10-14

URL des Volltextes:
http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-CMLC2%20Proceedings-rev2.pdf

Sprache:
Englisch

Dokumenttyp:
4. Beiträge in Sammelwerken; Tagungsband/Konferenzbeitrag/Proceedings

Schlagwörter:
Computerlinguistik, Internet, Sprache, Text, Tool, Wortschatz


Abstract(original):
This paper describes crawling and corpus processing in a distributed framework. We present new tools that build upon existing tools like Heritrix and Hadoop. Further, we propose a general workflow for harvesting, cleaning and processing web data from entire top-level domains in order to produce high-quality monolingual corpora using the least amount of language-specific data. We demonstrate the utility of the infrastructure by producing corpora for two under-resourced languages. Web corpus production for targeted languages and/or domains thus becomes feasible for anyone. (DIPF/Orig.)


DIPF-Abteilung:
Informationszentrum Bildung

Notizen:

zuletzt verändert: 11.11.2016