TY - GEN
T1 - Learning heterogeneous coupling relationships between non-iid terms
AU - Li, Mu
AU - Li, Jinjiu
AU - Ou, Yuming
AU - Zhang, Ya
AU - Luo, Dan
AU - Bahtia, Maninder
AU - Cao, Longbing
PY - 2014
Y1 - 2014
N2 - With the rapid proliferation of social media and online community, a vast amount of text data has been generated. Discovering the insightful value of the text data has increased its importance, a variety of text mining and process algorithms have been created in the recent years such as classification, clustering, similarity comparison. Most previous research uses a vector-space model for text representation and analysis. However, the vector-space model does not utilise the information about the relationships between the term to term. Moreover, the classic classification methods also ignore the relationships between each text document to another. In other word, the traditional text mining techniques assume the relation between terms and between documents are independent and identically distributed (iid). In this paper, we will introduce a novel term representation by involving the coupled relations from term to term. This coupled representation provides much richer information that enables us to create a coupled similarity metric for measuring document similarity, and a coupled document similarity based K-Nearest centroid classifier will be applied to the classification task. Experiments verify the proposed approach outperforming the classic vector-space based classifier, and show potential advantages and richness in exploring the other text mining tasks.
AB - With the rapid proliferation of social media and online community, a vast amount of text data has been generated. Discovering the insightful value of the text data has increased its importance, a variety of text mining and process algorithms have been created in the recent years such as classification, clustering, similarity comparison. Most previous research uses a vector-space model for text representation and analysis. However, the vector-space model does not utilise the information about the relationships between the term to term. Moreover, the classic classification methods also ignore the relationships between each text document to another. In other word, the traditional text mining techniques assume the relation between terms and between documents are independent and identically distributed (iid). In this paper, we will introduce a novel term representation by involving the coupled relations from term to term. This coupled representation provides much richer information that enables us to create a coupled similarity metric for measuring document similarity, and a coupled document similarity based K-Nearest centroid classifier will be applied to the classification task. Experiments verify the proposed approach outperforming the classic vector-space based classifier, and show potential advantages and richness in exploring the other text mining tasks.
KW - Non-iid
KW - Coupled similarity
KW - Vector representation
UR - http://www.scopus.com/inward/record.url?scp=84901677348&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-55192-5_7
DO - 10.1007/978-3-642-55192-5_7
M3 - Conference proceeding contribution
SN - 9783642551918
T3 - Lecture Notes in Computer Science
SP - 79
EP - 91
BT - Agents and Data Mining Interaction
A2 - Cao, Longbing
A2 - Zeng, Yifeng
A2 - Symeonidis, Andreas L.
A2 - Gorodetsky, Vladimir
A2 - Müller, Jörg P.
A2 - Yu, Philip S.
PB - Springer, Springer Nature
CY - Berlin
T2 - 9th International Workshop on Agents and Data Mining Interaction, ADMI 2013
Y2 - 6 May 2013 through 7 May 2013
ER -