TY - GEN
T1 - A spectral clustering approach to optimally combining numericalvectors with a modular network
AU - Shiga, Motoki
AU - Takigawa, Ichigaku
AU - Mamitsuka, Hiroshi
PY - 2007
Y1 - 2007
N2 - We address the issue of clustering numerical vectors with a network. The problem setting is basically equivalent to constrained clustering by Wagstaff and Cardie and semi-supervised clustering by Basu et al., but our focus is more on the optimal combination of two heterogeneous data sources. An application of this setting is web pages which can be numerically vectorized by their contents, e.g. term frequencies, and which are hyperlinked to each other, showing a network. Another typical application is genes whose behavior can be numerically measured and a gene network can be given from another data source.We first define a new graph clustering measure which we call normalized network modularity, by balancing the cluster size of the original modularity. We then propose a new clustering method which integrates the cost of clustering numerical vectors with the cost of maximizing the normalized network modularity into a spectral relaxation problem. Our learning algorithm is based on spectral clustering which makes our issue an eigenvalue problem and uses k-means for final cluster assignments. A significant advantage of our method is that we can optimize the weight parameter for balancing the two costs from the given data by choosing the minimum total cost. We evaluated the performance of our proposed method using a variety of datasets including synthetic data as well as real-world data from molecular biology. Experimental results showed that our method is effective enough to have good results for clustering by numerical vectors and a network.
AB - We address the issue of clustering numerical vectors with a network. The problem setting is basically equivalent to constrained clustering by Wagstaff and Cardie and semi-supervised clustering by Basu et al., but our focus is more on the optimal combination of two heterogeneous data sources. An application of this setting is web pages which can be numerically vectorized by their contents, e.g. term frequencies, and which are hyperlinked to each other, showing a network. Another typical application is genes whose behavior can be numerically measured and a gene network can be given from another data source.We first define a new graph clustering measure which we call normalized network modularity, by balancing the cluster size of the original modularity. We then propose a new clustering method which integrates the cost of clustering numerical vectors with the cost of maximizing the normalized network modularity into a spectral relaxation problem. Our learning algorithm is based on spectral clustering which makes our issue an eigenvalue problem and uses k-means for final cluster assignments. A significant advantage of our method is that we can optimize the weight parameter for balancing the two costs from the given data by choosing the minimum total cost. We evaluated the performance of our proposed method using a variety of datasets including synthetic data as well as real-world data from molecular biology. Experimental results showed that our method is effective enough to have good results for clustering by numerical vectors and a network.
KW - Eigenvalue problem
KW - Heterogeneous data sources
KW - K-means
KW - Network modularity
KW - Spectral clustering
UR - http://www.scopus.com/inward/record.url?scp=36849029834&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=36849029834&partnerID=8YFLogxK
U2 - 10.1145/1281192.1281262
DO - 10.1145/1281192.1281262
M3 - Conference contribution
AN - SCOPUS:36849029834
SN - 1595936092
SN - 9781595936097
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 647
EP - 656
BT - KDD-2007
T2 - KDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Y2 - 12 August 2007 through 15 August 2007
ER -