UN CORPUS DELLA STAMPA ITALIANA LOCALE
DOI:
https://doi.org/10.13135/2384-8987/3382Keywords:
Corpus design, Italian, local pressAbstract
A corpus of the Italian local press. This paper introduces CoSIL, a corpus of articles from Italian local newspapers containing about 180,000 texts and 66,000,000 words. The corpus was built to provide researchers with a freely downloadable balanced corpus of journalistic texts and a material for linguistic research on online local press, a nowadays-pervasive source of information. Besides the objectives behind the construction of the corpus, the paper describes its design and development, focusing on its representativeness and balance.
References
Aliprandi, S. (2013). Creative Commons: manuale operativo. Ledizioni.
Baroni, M., & Bernardini, S. (2004, May). BootCaT: Bootstrapping Corpora and Terms from the Web. In LREC
Baroni, M., Bernardini, S., Comastri, F., Piccioni, L., Volpi, A., Aston, G., & Mazzoleni, M. (2004). Introducing the La Repubblica Corpus: A Large, Annotated, TEI (XML)-compliant Corpus of Newspaper Italian. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC'04).
Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 43(3), 209-226.
Baroni, M., & Ueyama, M. (2006). Building general-and special-purpose corpora by web crawling. In Proceedings of the 13th NIJL international symposium, language corpora: Their compilation and application (pp. 31-40).
Kamocki, P.; Ketzan, E. (2014): Creative Commons and Language Resources: General Issues and what's new in CC 4.0. In: CLARIN Legal Issues Committee (CLIC)-White Paper Series. In rete, all’indirizzo https://www.clarin-d.de/images/legal/CLIC_white_paper_1.pdf.
Lyding, V., Stemle, E., Borghetti, C., Brunello, M., Castagnoli, S., Dell'Orletta, F., Pirrelli, V. (2014). The PAISA'Corpus of Italian Web Texts. In 9th Web as Corpus Workshop (WaC-9)@ EACL 2014 (pp. 36-43). EACL (European chapter of the Association for Computational Linguistics).
Magnini B., Pianta E., Girardi C., Negri M., Romano L., Speranza M., Bartalesi Lenzi V., Sprugnoli R., (2006). I-CAB: the Italian Content Annotation Bank. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, May. European Language Resources Association (ELRA).
McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge University Press.
Spina, S. (2014). Il Perugia Corpus: una risorsa di riferimento per l’italiano. Composizione, annotazione e valutazione. In First Italian Conference on Computational Linguistics CLiC-it 2014 (Vol. 1, pp. 354-359). Pisa University Press.
Downloads
Published
How to Cite
Issue
Section
License
RiCognizioni is published under a Creative Commons Attribution 4.0 International License.
With the licence CC-BY, authors retain the copyright, allowing anyone to download, reuse, re-print, modify, distribute and/or copy their contribution. The work must be properly attributed to its author.
It is not necessary to ask further permissions both to author or journal board.