Feature hashing for language and dialect identification

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

3 Citations (Scopus)
14 Downloads (Pure)


We evaluate feature hashing for language identification (LID), a method not previously used for this task. Using a standard dataset, we first show that while feature performance is high, LID data is highly dimensional and mostly sparse (>99.5%) as it includes large vocabularies for many languages; memory requirements grow as languages are added. Next we apply hashing using various hash sizes, demonstrating that there is no performance loss with dimensionality reductions of up to 86%. We also show that using an ensemble of low-dimension hash-based classifiers further boosts performance. Feature hashing is highly useful for LID and holds great promise for future work in this area.

Original languageEnglish
Title of host publicationACL 2017
Subtitle of host publicationProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers)
EditorsRegina Barzilay, Min-Yen Kan
Place of PublicationStroudsburg, PA
PublisherAssociation for Computational Linguistics (ACL)
Number of pages5
ISBN (Electronic)9781945626760
Publication statusPublished - 2017
Event55th Annual Meeting of the Association for Computational Linguistics, ACL 2017 - Vancouver, Canada
Duration: 30 Jul 20174 Aug 2017


Conference55th Annual Meeting of the Association for Computational Linguistics, ACL 2017

Bibliographical note

Copyright the Publisher. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.


Dive into the research topics of 'Feature hashing for language and dialect identification'. Together they form a unique fingerprint.

Cite this