Home > Publications > Document Representation and Dimension Reduction for Text Clustering

Document Representation and Dimension Reduction for Text Clustering


Increasingly large text datasets and the high dimensionality associated with natural language is a great challenge of text mining. In this research, a systematic study is conducted of application of three Dimension Reduction Techniques (DRT) on three different document representation methods in the context of the text clustering problem using several standard benchmark datasets. The dimensionality reduction methods considered include Independent Component Analysis (ICA), Latent Semantic Indexing (LSI) and one technique based on Document Frequency (DF). These three methods are applied on three Document rep- resentation methods based on the idea of Vector Space Model, namely word, term and N-Gram representations. Experiments with the k-means clustering algorithm show that ICA and LSI are clearly better than DF on all datasets. For word and N-Gram representation, ICA gives better results compared to LSI. Experiments also show that the word representation gives better clustering results compared to term and N-Gram representation. Finally, for N-Gram representation, it is shown that profile length equal to 2000 is enough to capture the information and in most cases, 4-Gram representation gives better performance compared to 3-Gram representation.

Authors: Evangelos E. Milios, M. Mahdi Shafiei, Singer Wang, Roger Zhang, Bin Tang, Jane Tougas, Raymond Spiteri

Download: DR_Proj_ver04