Studying the structure of given names and how they associate with gender and ethnicity is an interesting research topic that has recently found practical uses in various areas. Given the paucity of annotated name data, we develop and make available a new dataset containing 14k given names. Using this dataset, we take a data-driven approach to this task and achieve up to 90% accuracy for classifying the gender of unseen names. For ethnicity identification, our system achieves 83% accuracy. We also experiment with a feature analysis method for exploring the most informative features for this task.
|Number of pages||5|
|Journal||Proceedings of Australasian Language Technology Association Workshop 2014 : ALTA 2014|
|Publication status||Published - 2014|
|Event||Australasian Language Technology Association Workshop (12th : 2014) - Melbourne, Australia|
Duration: 26 Nov 2014 → 28 Nov 2014