TY - GEN
T1 - Micro-clustering by data polishing
AU - Uno, Takeaki
AU - Maegawa, Hiroki
AU - Nakahara, Takanobu
AU - Hamuro, Yukinobu
AU - Yoshinaka, Ryo
AU - Tatsuta, Makoto
N1 - Funding Information:
Acknowledgments: This work was supported by JST CREST Grant Number JPMJCR1401, Japan.
Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/1
Y1 - 2017/7/1
N2 - We address the problem of un-supervised soft-clustering that we call micro-clustering. The aim of the problem is to enumerate all groups composed of records strongly related to each other, whereas standard clustering methods find boundaries at which records are few. The existing methods have several weak points; generation of intractable amounts of clusters, biased size distributions, lack of robustness, etc. We propose a new methodology data polishing. Data polishing clarifies the cluster structures in the data by perturbating the data according to feasible hypothesis. More precisely, for graph clustering problems, data polishing replaces dense subgraphs that would correspond to clusters by cliques, and deletes edges not included in any dense subgraph. The clusters are clarified as maximal cliques, thus are easy to find, and the number of maximal cliques is reduced to tractable numbers. We also propose an efficient algorithm so that the computation is done in few minutes even for large scale data. The computational experiments demonstrate the efficiency of our formulation and algorithm, i.e., the number of solutions is small, such as 1,000, the members of each group are deeply related, and the computation time is short.
AB - We address the problem of un-supervised soft-clustering that we call micro-clustering. The aim of the problem is to enumerate all groups composed of records strongly related to each other, whereas standard clustering methods find boundaries at which records are few. The existing methods have several weak points; generation of intractable amounts of clusters, biased size distributions, lack of robustness, etc. We propose a new methodology data polishing. Data polishing clarifies the cluster structures in the data by perturbating the data according to feasible hypothesis. More precisely, for graph clustering problems, data polishing replaces dense subgraphs that would correspond to clusters by cliques, and deletes edges not included in any dense subgraph. The clusters are clarified as maximal cliques, thus are easy to find, and the number of maximal cliques is reduced to tractable numbers. We also propose an efficient algorithm so that the computation is done in few minutes even for large scale data. The computational experiments demonstrate the efficiency of our formulation and algorithm, i.e., the number of solutions is small, such as 1,000, the members of each group are deeply related, and the computation time is short.
KW - algorithm
KW - clustering
KW - data cleaning
KW - pattern mining
UR - http://www.scopus.com/inward/record.url?scp=85047735507&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85047735507&partnerID=8YFLogxK
U2 - 10.1109/BigData.2017.8258024
DO - 10.1109/BigData.2017.8258024
M3 - Conference contribution
AN - SCOPUS:85047735507
T3 - Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017
SP - 1012
EP - 1018
BT - Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017
A2 - Nie, Jian-Yun
A2 - Obradovic, Zoran
A2 - Suzumura, Toyotaro
A2 - Ghosh, Rumi
A2 - Nambiar, Raghunath
A2 - Wang, Chonggang
A2 - Zang, Hui
A2 - Baeza-Yates, Ricardo
A2 - Baeza-Yates, Ricardo
A2 - Hu, Xiaohua
A2 - Kepner, Jeremy
A2 - Cuzzocrea, Alfredo
A2 - Tang, Jian
A2 - Toyoda, Masashi
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th IEEE International Conference on Big Data, Big Data 2017
Y2 - 11 December 2017 through 14 December 2017
ER -