The Exciting Exploration of Unsupervised Deep Embedding for Clustering Analysis

By
Aoife McCardle
Digital Marketing Executive

As a data scientist joining Novafutur you have the opportunity to work on projects that incorporate new and exciting research. 


For example, inspired by Xie, Girshick and Farhadi’s paper “Unsupervised Deep Embedding for Clustering Analysis” (2016), our data science team have been exploring the possible benefits of reducing dimensionality in order to cluster their data far more efficiently. 


A brief explanation of Data Clustering methods so far:


The most popular methods for clustering data are K-means (MacQueen et al., 1967) and Gaussian Mixture Models (GMM) (Bishop, 2006). They are fast and applicable to a wide range of problems, however their distance metrics are limited to the original data space and they tend to be ineffective when input dimensionality is high. When you are trying to sort and cluster very large datasets by a lot of features, dimensionality is inevitably very high. This is a problem as you desperately want to avoid the “curse of dimensionality” (Bellman, 1961) if you can.


In comparison, Spectral Clustering algorithms have gained popularity recently as they allow for more flexible distance metrics and generally perform better than K-means. Yang et al. (2010)  and Nie et al. (2011) have explored combining spectral clustering with embedding and Tian et al. (2014) has proposed an algorithm based on spectral clustering that replaces eigenvalue decomposition with deep autoencoder; though this improves performance it further increases memory consumption. 


Also, as most spectral clustering algorithms need to compute the full graph Laplacian matrix, they have quadratic or super quadratic complexities in the number of data points which ultimately requires specialised machines with large memory for any data set larger than 10,000 points. Scaling up spectral clustering to a larger dataset therefore requires a trade-off in performance for speed (Yan et al., 2009) which would not be ideal when high performance is essential in data science. 


This is where Deep Embedded Clustering (DEC) comes in handy. Not only will it reduce the dimensionality of the dataset but because it is linear in the number of data points, it scales smoothly to large datasets!


Why do we care about DEC?


The experimentation with incorporating unsupervised DEC into our data system is just an example of how working within the data science team of Nova will give you the exciting opportunity to actively implement new data science research. This is just the tip of the iceberg in terms of ground-breaking research inclusion and experimentation, think about what you would be able to bring to the table.


So, how does Deep Embedded Clustering work?


The algorithm proposed by Xie, Girshick and Farhadi (2016) clusters data by simultaneously learning a set of k cluster centers in the feature space and the parameters of the deep neural network (DNN) that maps data points into the feature space. DNNs are perfect for this due to their theoretical function approximation properties and their demonstrated feature learning capabilities through training.


As the researchers put it:


“DEC has two phases: (1) parameter initialization with a deep autoencoder (Vincent et al., 2010) and (2) parameter optimization (i.e., clustering), where we iterate between computing an auxiliary target distribution and minimizing the Kullback–Leibler (KL) divergence to it.”


The bonus of this system is that it requires less supervision and intervention, once it is up and running it can be trusted to accurately cluster data quickly while learning from itself!



If you are interested in applying for a Data Science position at Nova please take a look at our careers page or click on the link provided here.


References:


Bellman, R. E. (2015). Adaptive control processes. Princeton university press.


Bishop, C. M. (2006). Pattern recognition. Machine learning, 128(9).


MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, No. 14, pp. 281-297).


Nie, F., Zeng, Z., Tsang, I. W., Xu, D., & Zhang, C. (2011). Spectral embedded clustering: A framework for in-sample and out-of-sample spectral clustering. IEEE Transactions on Neural Networks, 22(11), 1796-1808.


Tian, F., Gao, B., Cui, Q., Chen, E., & Liu, T. Y. (2014, June). Learning deep representations for graph clustering. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 28, No. 1).


Xie, J., Girshick, R., & Farhadi, A. (2016, June). Unsupervised deep embedding for clustering analysis. In International conference on machine learning (pp. 478-487). PMLR.


Yan, D., Huang, L., & Jordan, M. I. (2009, June). Fast approximate spectral clustering. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 907-916)


Yang, Y., Xu, D., Nie, F., Yan, S., & Zhuang, Y. (2010). Image clustering using local discriminant models and global integration. IEEE Transactions on Image Processing, 19(10), 2761-2773.


Written by:
Aoife McCardle
Contact email:
marketing@novafutur.com