Trefferliste für Suchanfragen

Ihre Abfrage:

id: 37066

Domain-specific corpus expansion with focused webcrawling Remus, Steffen; Biemann, Chris Sammelbandbeitrag | Aus: European Language Resources Association (Hrsg.): Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) | Portoroz: European Language Resources Association | 2016 37066 Endnote: Autor*innen: Remus, Steffen; Biemann, Chris
Titel: Domain-specific corpus expansion with focused webcrawling
Aus: European Language Resources Association (Hrsg.): Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), Portoroz: European Language Resources Association, 2016 , S. 3607-3611
URL: http://www.lrec-conf.org/proceedings/lrec2016/pdf/316_Paper.pdf
Dokumenttyp: 4. Beiträge in Sammelwerken; Tagungsband/Konferenzbeitrag/Proceedings
Sprache: Englisch
Schlagwörter: Algorithmus; Automatisierung; Bildung; Computerlinguistik; Data Mining; Hypertext; Modell; Sprache; Text; Textanalyse
Abstract: This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able to stay focused in domain and language. The first experiment shows that the crawler stays in a focused domain, the second experiment demonstrates that language models trained on focused crawls obtain better perplexity scores on in-domain corpora. We distribute the focused crawler as open source software. (DIPF/Orig.)
DIPF-Abteilung: Informationszentrum Bildung