TY - JOUR
T1 - Improving topic models with Latent Feature Word Representations
AU - Nguyen, Dat Quoc
AU - Billingsley, Richard
AU - Du, Lan
AU - Johnson, Mark
N1 - Copyright the Publisher 2015. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.
PY - 2015
Y1 - 2015
N2 - Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.
AB - Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.
M3 - Article
SN - 2307-387X
VL - 3
SP - 299
EP - 313
JO - Transactions of the Association for Computational Linguistics
JF - Transactions of the Association for Computational Linguistics
ER -