We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate an upper bound on possible performance using ensemble and oracle combination, and provide learning curves to help us understand which languages are more challenging. A number of difficult sentences are identified and investigated further with human annotation.
|Title of host publication||Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016|
|Editors||Nicoletta Calzolari, Khalid Choukri, Helene Mazo, Asuncion Moreno, Thierry Declerck, Sara Goggi, Marko Grobelnik, Jan Odijk, Stelios Piperidis, Bente Maegaard, Joseph Mariani|
|Publisher||European Language Resources Association (ELRA)|
|Number of pages||8|
|Publication status||Published - 1 Jan 2016|
|Event||10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia|
Duration: 23 May 2016 → 28 May 2016
|Conference||10th International Conference on Language Resources and Evaluation, LREC 2016|
|Period||23/05/16 → 28/05/16|
Bibliographical noteCopyright the European Language Resources Association. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.
- Language identification
- Language varieties