Abstract—This paper presence a new feature selection
method which can be used for creating data set in order to
classify Twitter short messages. The Twitter short messages
contain only 140 characters. Thus, the number of words per
sentence is almost equal for all sentences. Once you pool the all
text messages together, there can be number of words in the pool
but, for a given sentence, there will be only few words included
from the pool. This causes to have a sparse matrix as the feature
vector. By removing the unrelated words from the feature space,
the dimension can be reduced and therefore, the sparseness can
be reduced. The unrelated words can be define as the common
words (high frequent words) and noise words (low frequent
words). Even though by removing these unrelated words, still it
may contain some unrelated words. Thus, a feature selection
technique was needed to apply in order to select the best feature
set. The suggested new feature selection method was based on
the Information Theory. It was named as Ratio Method. The
calculated value increase when the word occurs frequently in a
particular group and it decrease when the word occur in all
groups. The best features can be choose by using a proper
threshold. Some popular text classifiers such as SVM, Naïve
Bayes and Decision Trees are used to evaluate the performance
of the new feature selection method and to compare the new
method with existing methods.
Index Terms—Ratio method, information theory, term
frequency, inverse document frequency.
The authors are with the University of Colombo School of Computing,
Colombo 7, Sri Lanka (e-mail: inoshi@scorelab.org, kasun@ucsc.lk).
Cite: Inoshika Dilrukshi and Kasun de Zoysa, "A Feature Selection Method for Twitter News Classification," International Journal of Machine Learning and Computing vol.4, no. 4, pp. 365-370, 2014.