Top-level domain crawling for producing comprehensive monolingual corpora from the web
This paper describes crawling and corpus processing in a distributed framework.
We present new tools that build upon existing tools like Heritrix and Hadoop. Further, we propose a general workflow for harvesting, cleaning and processing web data from entire top-level domains in order to produce high-quality monolingual corpora using the least amount of language-specific data. We demonstrate the utility of the infrastructure by producing corpora for two under-resourced languages. Web corpus production for targeted languages and/or domains thus becomes feasible for anyone.
Goldhahn, Dirk; Remus, Steffen; Quasthoff, Uwe; Biemann, Chris: Top-level domain crawling for producing comprehensive monolingual corpora from the web, in: Kupietz, Marc; Biber, Hanno; Lüngen, Harald; Banski, Piotr; Breiteneder, Evelyn; Mörth, Karlheinz; Witt, Andreas; Takhsha, Jani (eds.): Proceedings of the LREC-14 Workshop on Challenges in the management of Large Corpora (CMLC-2) Reykjavik, Iceland: European Language Resources Association (ELRA) (2014), 10-14