Feature hashing for language and dialect identification

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

Abstract

We evaluate feature hashing for language identification (LID), a method not previously used for this task. Using a standard dataset, we first show that while feature performance is high, LID data is highly dimensional and mostly sparse (>99.5%) as it includes large vocabularies for many languages; memory requirements grow as languages are added. Next we apply hashing using various hash sizes, demonstrating that there is no performance loss with dimensionality reductions of up to 86%. We also show that using an ensemble of low-dimension hash-based classifiers further boosts performance. Feature hashing is highly useful for LID and holds great promise for future work in this area.

LanguageEnglish
Title of host publicationACL 2017
Subtitle of host publicationProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers)
Place of PublicationStroudsburg, PA
PublisherAssociation for Computational Linguistics (ACL)
Pages399-403
Number of pages5
Volume2
ISBN (Electronic)9781945626760
DOIs
Publication statusPublished - 2017
Event55th Annual Meeting of the Association for Computational Linguistics, ACL 2017 - Vancouver, Canada
Duration: 30 Jul 20174 Aug 2017

Conference

Conference55th Annual Meeting of the Association for Computational Linguistics, ACL 2017
CountryCanada
CityVancouver
Period30/07/174/08/17

Fingerprint

dialect
Classifiers
Data storage equipment
language
performance
vocabulary
Language

Cite this

Malmasi, S., & Dras, M. (2017). Feature hashing for language and dialect identification. In ACL 2017: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers) (Vol. 2, pp. 399-403). Stroudsburg, PA: Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/P17-2063
Malmasi, Shervin ; Dras, Mark. / Feature hashing for language and dialect identification. ACL 2017: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers). Vol. 2 Stroudsburg, PA : Association for Computational Linguistics (ACL), 2017. pp. 399-403
@inproceedings{f82d79176f9347daac5f442974f37a85,
title = "Feature hashing for language and dialect identification",
abstract = "We evaluate feature hashing for language identification (LID), a method not previously used for this task. Using a standard dataset, we first show that while feature performance is high, LID data is highly dimensional and mostly sparse (>99.5{\%}) as it includes large vocabularies for many languages; memory requirements grow as languages are added. Next we apply hashing using various hash sizes, demonstrating that there is no performance loss with dimensionality reductions of up to 86{\%}. We also show that using an ensemble of low-dimension hash-based classifiers further boosts performance. Feature hashing is highly useful for LID and holds great promise for future work in this area.",
author = "Shervin Malmasi and Mark Dras",
year = "2017",
doi = "10.18653/v1/P17-2063",
language = "English",
volume = "2",
pages = "399--403",
booktitle = "ACL 2017",
publisher = "Association for Computational Linguistics (ACL)",

}

Malmasi, S & Dras, M 2017, Feature hashing for language and dialect identification. in ACL 2017: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers). vol. 2, Association for Computational Linguistics (ACL), Stroudsburg, PA, pp. 399-403, 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, 30/07/17. https://doi.org/10.18653/v1/P17-2063

Feature hashing for language and dialect identification. / Malmasi, Shervin; Dras, Mark.

ACL 2017: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers). Vol. 2 Stroudsburg, PA : Association for Computational Linguistics (ACL), 2017. p. 399-403.

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionResearchpeer-review

TY - GEN

T1 - Feature hashing for language and dialect identification

AU - Malmasi, Shervin

AU - Dras, Mark

PY - 2017

Y1 - 2017

N2 - We evaluate feature hashing for language identification (LID), a method not previously used for this task. Using a standard dataset, we first show that while feature performance is high, LID data is highly dimensional and mostly sparse (>99.5%) as it includes large vocabularies for many languages; memory requirements grow as languages are added. Next we apply hashing using various hash sizes, demonstrating that there is no performance loss with dimensionality reductions of up to 86%. We also show that using an ensemble of low-dimension hash-based classifiers further boosts performance. Feature hashing is highly useful for LID and holds great promise for future work in this area.

AB - We evaluate feature hashing for language identification (LID), a method not previously used for this task. Using a standard dataset, we first show that while feature performance is high, LID data is highly dimensional and mostly sparse (>99.5%) as it includes large vocabularies for many languages; memory requirements grow as languages are added. Next we apply hashing using various hash sizes, demonstrating that there is no performance loss with dimensionality reductions of up to 86%. We also show that using an ensemble of low-dimension hash-based classifiers further boosts performance. Feature hashing is highly useful for LID and holds great promise for future work in this area.

UR - http://www.scopus.com/inward/record.url?scp=85040605553&partnerID=8YFLogxK

U2 - 10.18653/v1/P17-2063

DO - 10.18653/v1/P17-2063

M3 - Conference proceeding contribution

VL - 2

SP - 399

EP - 403

BT - ACL 2017

PB - Association for Computational Linguistics (ACL)

CY - Stroudsburg, PA

ER -

Malmasi S, Dras M. Feature hashing for language and dialect identification. In ACL 2017: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers). Vol. 2. Stroudsburg, PA: Association for Computational Linguistics (ACL). 2017. p. 399-403 https://doi.org/10.18653/v1/P17-2063