Modeling language change in historical corpora

the case of Portuguese

Marcos Zampieri, Shervin Malmasi, Mark Dras

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contribution

6 Citations (Scopus)
3 Downloads (Pure)

Abstract

This paper presents a number of experiments to model changes in a historical Portuguese corpus composed of literary texts for the purpose of temporal text classification. Algorithms were trained to classify texts with respect to their publication date taking into account lexical variation represented as word n-grams, and morphosyntactic variation represented by part-of-speech (POS) distribution.
We report results of 99.8% accuracy using word unigram features with a Support Vector Machines classifier to predict the publication date of documents in time intervals of both one century and half a century. A feature analysis is performed to investigate the most informative features for this task and how they are linked to language change.
Original languageEnglish
Title of host publicationProceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
EditorsNicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis
Place of PublicationParis, France
PublisherEuropean Language Resources Association (ELRA)
Pages4098-4104
Number of pages7
ISBN (Print)9782951740891
Publication statusPublished - 2016
EventInternational Conference on Language Resources and Evaluation (10th : 2016) - Portorož, Slovenia
Duration: 23 May 201628 May 2016

Conference

ConferenceInternational Conference on Language Resources and Evaluation (10th : 2016)
Abbreviated titleLREC 2016
CountrySlovenia
CityPortorož
Period23/05/1628/05/16

Bibliographical note

Copyright the Author(s) 2016. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

Keywords

  • Language Change
  • Temporal Text Classification
  • Support Vector Machines
  • Text Categorization
  • Support vector machines
  • Language change
  • Temporal text classification
  • Text categorization

Fingerprint Dive into the research topics of 'Modeling language change in historical corpora: the case of Portuguese'. Together they form a unique fingerprint.

Cite this