-
-
Autor*innen: Goldhahn, Dirk; Remus, Steffen; Quasthoff, Uwe; Biemann, Chris
Titel: Top-level domain crawling for producing comprehensive monolingual corpora from the web
Aus: Kupietz, Marc;Biber, Hanno;Lüngen, Harald;Banski, Piotr;Breiteneder, Evelyn;Mörth, Karlheinz;Witt, Andreas;Takhsha, Jani (Hrsg.): Proceedings of the LREC-14 Workshop on Challenges in the management of Large Corpora (CMLC-2), Reykjavik: European Language Resources Association, 2014 , S. 10-14
URL: http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-CMLC2%20Proceedings-rev2.pdf
Dokumenttyp: 4. Beiträge in Sammelwerken; Tagungsband/Konferenzbeitrag/Proceedings
Sprache: Englisch
Schlagwörter: Computerlinguistik; Internet; Sprache; Text; Tool; Wortschatz
Abstract: This paper describes crawling and corpus processing in a distributed framework. We present new tools that build upon existing tools like Heritrix and Hadoop. Further, we propose a general workflow for harvesting, cleaning and processing web data from entire top-level domains in order to produce high-quality monolingual corpora using the least amount of language-specific data. We demonstrate the utility of the infrastructure by producing corpora for two under-resourced languages. Web corpus production for targeted languages and/or domains thus becomes feasible for anyone. (DIPF/Orig.)
DIPF-Abteilung: Informationszentrum Bildung