Modeling language change in historical corpora: the case of Portuguese

Marcos Zampieri, Shervin Malmasi, Mark Dras

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

Abstract

This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution.
We report results of 99.8% accuracy using word unigram features with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change.
LanguageEnglish
Title of host publicationProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
EditorsNicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis
Place of PublicationParis, France
PublisherEuropean Language Resources Association (ELRA)
Pages4098-4104
Number of pages7
ISBN (Print)9782951740891
Publication statusPublished - 2016
EventInternational Conference on Language Resources and Evaluation (10th : 2016) - Portorož, Slovenia
Duration: 23 May 201628 May 2016

Conference

ConferenceInternational Conference on Language Resources and Evaluation (10th : 2016)
Abbreviated titleLREC 2016
CountrySlovenia
CityPortorož
Period23/05/1628/05/16

Fingerprint

Support vector machines
Classifiers
Experiments
Modeling languages

Bibliographical note

Copyright the Author(s) 2016. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Keywords

  • Language Change
  • Temporal Text Classification
  • Support Vector Machines
  • Text Categorization
  • Support vector machines
  • Language change
  • Temporal text classification
  • Text categorization

Cite this

Zampieri, M., Malmasi, S., & Dras, M. (2016). Modeling language change in historical corpora: the case of Portuguese. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, ... S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (pp. 4098-4104). [706] Paris, France: European Language Resources Association (ELRA).
Zampieri, Marcos ; Malmasi, Shervin ; Dras, Mark. / Modeling language change in historical corpora : the case of Portuguese. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). editor / Nicoletta Calzolari ; Khalid Choukri ; Thierry Declerck ; Sara Goggi ; Marko Grobelnik ; Bente Maegaard ; Joseph Mariani ; Hélène Mazo ; Asunción Moreno ; Jan Odijk ; Stelios Piperidis. Paris, France : European Language Resources Association (ELRA), 2016. pp. 4098-4104
@inproceedings{3477bfd0266e445f959afcb8b1d11fd4,
title = "Modeling language change in historical corpora: the case of Portuguese",
abstract = "This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution.We report results of 99.8{\%} accuracy using word unigram features with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change.",
keywords = "Language Change, Temporal Text Classification, Support Vector Machines, Text Categorization, Support vector machines, Language change, Temporal text classification, Text categorization",
author = "Marcos Zampieri and Shervin Malmasi and Mark Dras",
note = "Copyright the Author(s) 2016. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.",
year = "2016",
language = "English",
isbn = "9782951740891",
pages = "4098--4104",
editor = "Nicoletta Calzolari and Khalid Choukri and Thierry Declerck and Sara Goggi and Marko Grobelnik and Bente Maegaard and Joseph Mariani and H{\'e}l{\`e}ne Mazo and Asunci{\'o}n Moreno and Jan Odijk and Stelios Piperidis",
booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)",
publisher = "European Language Resources Association (ELRA)",

}

Zampieri, M, Malmasi, S & Dras, M 2016, Modeling language change in historical corpora: the case of Portuguese. in N Calzolari, K Choukri, T Declerck, S Goggi, M Grobelnik, B Maegaard, J Mariani, H Mazo, A Moreno, J Odijk & S Piperidis (eds), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)., 706, European Language Resources Association (ELRA), Paris, France, pp. 4098-4104, International Conference on Language Resources and Evaluation (10th : 2016), Portorož, Slovenia, 23/05/16.

Modeling language change in historical corpora : the case of Portuguese. / Zampieri, Marcos; Malmasi, Shervin; Dras, Mark.

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). ed. / Nicoletta Calzolari; Khalid Choukri; Thierry Declerck; Sara Goggi; Marko Grobelnik; Bente Maegaard; Joseph Mariani; Hélène Mazo; Asunción Moreno; Jan Odijk; Stelios Piperidis. Paris, France : European Language Resources Association (ELRA), 2016. p. 4098-4104 706.

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

TY - GEN

T1 - Modeling language change in historical corpora

T2 - the case of Portuguese

AU - Zampieri, Marcos

AU - Malmasi, Shervin

AU - Dras, Mark

N1 - Copyright the Author(s) 2016. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

PY - 2016

Y1 - 2016

N2 - This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution.We report results of 99.8% accuracy using word unigram features with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change.

AB - This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution.We report results of 99.8% accuracy using word unigram features with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change.

KW - Language Change

KW - Temporal Text Classification

KW - Support Vector Machines

KW - Text Categorization

KW - Support vector machines

KW - Language change

KW - Temporal text classification

KW - Text categorization

UR - http://www.lrec-conf.org/proceedings/lrec2016/index.html

UR - http://www.scopus.com/inward/record.url?scp=85008357029&partnerID=8YFLogxK

M3 - Conference proceeding contribution

SN - 9782951740891

SP - 4098

EP - 4104

BT - Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

A2 - Calzolari, Nicoletta

A2 - Choukri, Khalid

A2 - Declerck, Thierry

A2 - Goggi, Sara

A2 - Grobelnik, Marko

A2 - Maegaard, Bente

A2 - Mariani, Joseph

A2 - Mazo, Hélène

A2 - Moreno, Asunción

A2 - Odijk, Jan

A2 - Piperidis, Stelios

PB - European Language Resources Association (ELRA)

CY - Paris, France

ER -

Zampieri M, Malmasi S, Dras M. Modeling language change in historical corpora: the case of Portuguese. In Calzolari N, Choukri K, Declerck T, Goggi S, Grobelnik M, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Paris, France: European Language Resources Association (ELRA). 2016. p. 4098-4104. 706