TY - GEN
T1 - Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings
AU - Pang, Guansong
AU - Cao, Longbing
AU - Chen, Ling
AU - Liu, Huan
PY - 2016
Y1 - 2016
N2 - Proper feature selection for unsupervised outlier detection can improve detection performance but is very challenging due to complex feature interactions, the mixture of relevant features with noisy/redundant features in imbalanced data, and the unavailability of class labels. Little work has been done on this challenge. This paper proposes a novel Coupled Unsupervised Feature Selection framework (CUFS for short) to filter out noisy or redundant features for subsequent outlier detection in categorical data. CUFS quantifies the outlierness (or relevance) of features by learning and integrating both the feature value couplings and feature couplings. Such value-To-feature couplings capture intrinsic data characteristics and distinguish relevant features from those noisy/redundant features. CUFS is further instantiated into a parameter-free Dense Subgraph-based Feature Selection method, called DSFS. We prove that DSFS retains a 2-Approximation feature subset to the optimal subset. Extensive evaluation results on 15 real-world data sets show that DSFS obtains an average 48% feature reduction rate, and enables three different types of pattern-based outlier detection methods to achieve substantially better AUC improvements and/or perform orders of magnitude faster than on the original feature set. Compared to its feature selection contender, on average, all three DSFS-based detectors achieve more than 20% AUC improvement.
AB - Proper feature selection for unsupervised outlier detection can improve detection performance but is very challenging due to complex feature interactions, the mixture of relevant features with noisy/redundant features in imbalanced data, and the unavailability of class labels. Little work has been done on this challenge. This paper proposes a novel Coupled Unsupervised Feature Selection framework (CUFS for short) to filter out noisy or redundant features for subsequent outlier detection in categorical data. CUFS quantifies the outlierness (or relevance) of features by learning and integrating both the feature value couplings and feature couplings. Such value-To-feature couplings capture intrinsic data characteristics and distinguish relevant features from those noisy/redundant features. CUFS is further instantiated into a parameter-free Dense Subgraph-based Feature Selection method, called DSFS. We prove that DSFS retains a 2-Approximation feature subset to the optimal subset. Extensive evaluation results on 15 real-world data sets show that DSFS obtains an average 48% feature reduction rate, and enables three different types of pattern-based outlier detection methods to achieve substantially better AUC improvements and/or perform orders of magnitude faster than on the original feature set. Compared to its feature selection contender, on average, all three DSFS-based detectors achieve more than 20% AUC improvement.
UR - http://www.scopus.com/inward/record.url?scp=85014523542&partnerID=8YFLogxK
UR - http://purl.org/au-research/grants/arc/DP130102691
UR - http://purl.org/au-research/grants/arc/DP140100545
U2 - 10.1109/ICDM.2016.0052
DO - 10.1109/ICDM.2016.0052
M3 - Conference proceeding contribution
SN - 9781509054749
SP - 410
EP - 419
BT - 16th IEEE International Conference on Data Mining
A2 - Bonchi, Francesco
A2 - Domingo-Ferrer, Josep
A2 - Baeza-Yates, Ricardo
A2 - Zhou, Zhi-Hua
A2 - Wu, Xindong
PB - Institute of Electrical and Electronics Engineers (IEEE)
CY - Piscataway, NJ
T2 - 16th IEEE International Conference on Data Mining, ICDM 2016
Y2 - 12 December 2016 through 15 December 2016
ER -