Content uploaded by Yang Li
Author content
All content in this area was uploaded by Yang Li on Mar 13, 2019
Content may be subject to copyright.
Short Text Classification With A Convolutional Neural Networks Based
Method
Yibo Hu, Yang Li, Tao Yang, Quan Pan
Abstract— The traditional machine learning algorithms are
easily affected by datasets in short text classification tasks, so
they have weak generalization ability when confronted with
new situations. This paper presents a new method SVMCNN
by combining Convolutional Neural Networks and Support
Vector Machine. Training the SVMCNN model with labeled
datasets, and using the collected Twitter data for classification
test. The results show that the SVMCNN, especially pre-trained
SVMCNN has good performance in short text classification,
which gets the high Precision rate, Recall rate and F1-measure.
I. INTRODUCTION
According to the latest statistics, the number of global
Internet users has exceeded 4 billion. In 2017, the number
of newly added Internet users was 250 million, it meant
the Internet penetration rate had exceeded 50%, and the
users who use social media had increased by 13%. There
are numerous social applications attract many Internet users,
for example, the number of Twitters monthly active users
has reached 3.3 billion as of 20171. Internet users generate
a large amount of information on these social applications
every day, including articles, news, and comments. These
information are mainly distributed as texts, and many of
them are in short texts. Through these information, it is pos-
sible to filter spam, advertisement, and illegal information.
Meanwhile, by analyzing these information, some significant
events may be extracted, such as natural disasters, large-scale
public events, etc. Text classification is a useful technology
for these scenarios.
Traditional text classification methods are mainly based
on statistical principles, using manually labeled datasets to
train classifiers, and then classifying new data. Nill et al. use
a KNN (K-Nearest Neighbor) based system to classify the
Japanese Nursing-case text, and select the candidate category
for each text[1]. Naive Bayes (NB) is often used as a baseline
in text classification, because it is fast and easy to imple-
ment. Rennie et al. show that with proper preprocessing,
NB can be compared with more advanced methods such
as support vector machine[2]. Diab and Hindi use three
This work was not supported by any organization
Y. Hu is with the School of Automation, Northwestern Polytechnical
University, Xi’an, China 710072 hyb@mail.nwpu.edu.cn
Y. Li is with the School of Automation, Northwestern Polytechnical U-
niversity, Xi’an, China 710072 liyangnpu@mail.nwpu.edu.cn
T. Yang is with the School of Automation, Northwestern
Polytechnical University, Xi’an, China 710072 corresponding
author to provide phone: 13571913583;
yangtao107@nwpu.edu.cn
Q. Pan is with the School of Automation, Northwestern Polytechnical
University, Xi’an, China 710072 quanpan@nwpu.edu.cn
1https://www.statista.com/statistics/282087/number-of-monthly-active-
twitter-users/
ways to improve performance of NB when dealing with
sparse text data[3]. Xu proposes three Bayesian counterparts,
and proves that Bayesian NB classifier with Gaussian event
model is obviously better than classical counterpart to text
classification[4]. Lilleberg et al. use SVM (Support Vector
Machine) to verify that combining TF-IDF and word2vec can
outperform TF-IDF on text classification, but this study does
not consider the impact of redundant features on SVM[5].
Xia et al. use SVM to perform Chinese sentiment analysis by
the online hotel reviews[6]. The effect of different stop word
filtering methods and feature selection methods are verified
by SVM. But the representation of text data loses semantic
information. Though NB and SVM are commonly used for
text classification, they heavily depending on the variant,
features and datasets are used. For short snippet sentiment
tasks, NB actually does better than SVM, and the opposite
result holds while for longer documents. Wang et al. propose
a NBSVM model, which using NB log-count ratios as feature
values and using SVM for classification[7]. But NB cant
guarantee that it will provide the most representative features.
In recent years, using neural networks to establish lan-
guage models has gradually matured. Bengio et al. propose
a neural network method to construct a binary language
model[8]. Hinton proposes the concept of word embedding,
which is valued by more and more researchers[9]. Word
embedding not only avoids the ”dimension disaster”, but
also describes the relationship between words from a higher
semantic level.
With the appearance and development of deep learning,
many fields have its presence, such as text classification
tasks[10][11]. Kim uses a simple convolutional neural net-
work(CNN) model for text semantic classification[12]. The
experimental results show that the CNN model performs no
less than traditional methods in the semantic classification of
sentences. Kalchbrenner et al. propose a multi-layered CNN,
and k-max pooling is used for sentence classification[13].
Zhang et al. using CNN for the text feature extraction[14].
Conneau et al. propose a deep CNN model for text clas-
sification, but it takes a long time[15]. Lai et al. propose
RCNN model that combines CNN and RNN[16]. The RCNN
model uses RNN to capture context information, and uses
CNN to construct a semantic representation of text. Lee and
Dernoncourt use RNN and CNN to classify continuous short
texts, which shows that CNN works better[17]. At the same
time, the short text representation is better than the class
representation, and the effect is reduced when they are used
simultaneously. This is because the short text representation
contains richer information than the class representation.
2018 15th International Conference on
Control, Automation, Robotics and Vision (ICARCV)
Singapore, November 18-21, 2018
978-1-5386-9582-1/18/$31.00 ©2018 IEEE 1432
This paper focuses on user comments on the Twitter social
platform, and discusses a short text classification method that
combines CNN with SVM, which using CNN for features
extraction and using SVM for classification.
II. SHO RT TEXT CLASSIFICATION
Short texts are unstructured data and need to be converted
into structured data that can be processed by computer
directly. Structured representations contain a large amount
of semantically relevant information, that is, contain a large
number of features. There are many features that are less
useful for classification. Extracting the most important set of
features, and using them to train the classifiers.
At present, the commonly used text classification meth-
ods include the traditional methods and the deep learning
based methods. The traditional text classification methods
are mainly based on machine learning, and use the principles
of statistics to classify. The deep learning based methods
mainly use the neural network to extract the features of
texts, which can combine the low-level features to form more
abstract high-level features. This paper tries to combine these
two methods, and proposes a Support Vector Machine with
Convolutional Neural Network(SVMCNN).
III. SVMCNN MOD EL
This paper combines CNN with SVM, because CNN can
capture features between consecutive words through convo-
lution processing, and SVM can get the optimal solution of
existing information in the case of limited samples. Fig. 1
shows the work process of SVMCNN model, which uses
CNN to extract features of short texts and then uses SVM
classifier for classification.
There are many short texts of various lengths in
the Twitter datasets. Each short text is initialized with
vector(V1,V2, ..., VN), Let Vi∈Rkbe the k-dimensional word
vector corresponding to the i-th word in the short text, N
means the number of words in the longest text of the Twitter
datasets. Short texts with a length less than Nare filled, so
that all short texts have the same length. Cascading the word
vectors of each short text as
S=V1⊕V2⊕ · · · ⊕ VN(1)
where Sis the representation the short text, ⊕is the con-
catenation operator. The word vectors can be concatenated
together as a matrix, and then input this matrix into CNN
model for feature extraction.
A convolution operation involves a filter W∈Rh×k, where
his the height of the convolution kernel window, a window of
h word vectors can be mapped and produce a new feature.
Let Vi:i+h−1refers to the concatenation of hword vectors
(Vi,Vi+1, ..., Vi+h−1), their feature can be generated by
ci=f(W·Vi:i+h−1+b)(2)
where b∈Ris a bias term, fis a non-linear activation
function, such as Sigmoid, Tanh, Relu. This paper uses
ReLu as the activation function. ReLu is a piecewise linear
function that can reduce the interdependence of parameters.
Therefore, ReLu can ease the overfitting problem.
The filter is applied to each possible window of word
vectors (V1:h,V2:h+1, ..., Vn−h+1:n)to produce a feature map
[c1,c2, ..., cn−h+1]. Different filters can extract text features
from different perspectives. Therefore, different feature maps
can be obtained by setting the filter size and the number of
each filter.
After the feature map, reducing the parameters by the max-
pooling layer and obtaining the optimal features. Then, all
the obtained local optimal features are connected through a
fully connected layer whose output is the feature vector of
the short text.
Finally, using SVM classifier to classify the short text fea-
tures. Let xi= (x(1)
i,x(2)
i, ..., x(m)
i)Tbe the i-th k-dimensional
feature vector, yi∈ {−1,1}is the category of the i-th short
text. The SVM classifier can separate the feature vectors of
the short text by learning to find a hyperplane
#»
ω·x+b=0 (3)
where #»
ωis the normal vector that determines the direction
of the hyperplane, bis the displacement term that determines
the distance between the hyperplane and the origin. The
distance from xito the hyperplane
ri=
#»
ωxi+b
||#»
ω|| (4)
Finding the optimal hyperplane is to find the nearest two d-
ifferent vectors that their distances to the hyperplane is equal
and the sum of the distances from them to the hyperplane is
the furthest. That’s equivalent to
min 1
2||#»
ω||2
s.t.yi(ωTxi+b)≥1,i=1,··· ,n
(5)
Getting #»
ωand bof the optimal hyperplane, and then using
this optimal hyperplane to classify short text feature vectors.
IV. EXP ERI MEN T DESI GN AN D RESU LT S ANALYS IS
A. Datasets and Evaluation
This paper applies to two datasets. One is Sentiment po-
larity datasets2. The role of the datasets is to train SVMCNN
model parameters. The datasets include positive subsets and
negative subsets, and each of them contains more than 5,000
movie-review data. This paper trains the model with 10-fold
cross-validation. The other one is Twitter datasets. The role
of the datasets is to evaluate the generalization ability of
the SVMCNN model. The Twitter datasets imclude a total
of 3,169 comments data collected from the Twitter social
platform, and every piece of data carries the user’s sentiment.
For example, ”thankfully, overall, in the long run, things
are getting better in the world”, this is a positive sentiment
obviously. Another example, ”so sad to hear of the terrorist
attack in Egypt”, this comment expresses negative sentiment.
Since there are many illegal characters in the Twitter data,
2http://www.cs.cornell.edu/people/pabo/movie-review-data/
1433
Fig. 1. SVMCNN model
these will affect the short text classification. Therefore, the
Twitter data need to be pre-processed to remove unnecessary
characters. This paper uses movie-review data to train model
and uses Twitter data to evaluate model, which is to evaluate
the models ability to adapt the big different data.
This paper evaluates the performance of algorithms with
three indicators, Precision rate, Recall rate, and F1-measure.
B. Experiment Design
It’s essential to use CNN to extract feature vectors of short
texts. Short texts must be initialized into word vectors when
they are input into the CNN model. Different initialization
methods have different classification effects. This paper
uses random initialization and pre-trained methods for text
representation respectively.
Random initialization only needs to input the datasets
into the CNN model whose input layer is used for text
quantization. In the pre-trained initialization, datasets are
mapped to word vectors based on pre-trained. This paper
uses a public word embedding model which is pre-trained
by word2vec3. Word2vec can obtain a word vector space by
performing unsupervised learning on a large amount of text
corpora. As long as collecting a large amount of text corpora
covering most of the daily work, a universal word embedding
model can be pre-trained. Using this word embedding model
can make the initialized word vectors contain more semantic
information.
This paper builds a CNN model with three convolutional
layers based on the TensorFlow4. Specific parameters of
CNN model are in Table I.
Each short text can be represented as a 384-dimensional
vector after feature extraction by CNN. These vectors rep-
resent the main features of each short text and can be input
into SVM classifier for classification.
C. Results and Analysis
This paper compares random initialization and pre-trained
initialization for short text features extraction with CNN
model. Training the CNN model with sentiment polarity
3GoogleNews-vectors-negative300.bin
4https://www.tensorflow.org/
TABLE I
CNN MODEL PARAMETERS
Parameter Value
Word vector dimension 128
Filter size 3, 4, 5
Filter Number 128
Dropout rate 0.5
Batch size 64
Steps 3000
Learning rate 10−3
datasets. When using random initialization, the model starts
to converge after 2000 steps, and the accuracy on train set is
95% when the training is completed. However, when using
the pre-trained initialization, the training has already begun
to converge when it reaches 1000 steps, and the accuracy on
train set is close to 100%.
Then using various algorithms to predict the categories of
Twitter short texts. In addition to using SVMCNN and CNN,
this paper also uses other text classification methods. All
of these model use sentiment polarity datasets for training,
and then predict Twitter datasets. Since the Twitter datasets
are unlabeled, this paper analyze the real categories manu-
ally. Comparing the real categories with the predictions of
the models, and calculating the classification results of all
models finally. The model training results on the Sentiment
polarity datasets are in Table II, the prediction results on the
Twitter datasets are in Table III.
From the results, it can be seen that the model initial-
ized with pre-trained is better than the model with random
initialization, the three indicators of the former are all
higher than the latter. In addition, the SVMCNN model
performs well in all aspects. Especially, the SVMCNN with
pre-trained initialization has three highest indicators. The
model’s classification precision rate on Sentiment polarity
datasets is about 92%, and the test precision rate on Twitter
also closes to 90%. The Recall rate and F1-Measure also
describe the good performance of the SVMCNN model.
SVMCNN can achieve such results because it makes full
use of the advantages of CNN and SVM. It can handle
interactions between nonlinear features without relying on
all data. Apart from this, SVMCNN has high generalization
1434
TABLE II
THE R ESULTS O N THE SEN TIME NT PO LARI TY DATASET S
Precision Recall F1-Measure
Model Random Pre-trained Random Pre-trained Random Pre-trained
SVMCNN 91.89% 92.11% 85.00% 87.50% 88.30% 89.70%
CNN 88.89% 91.70% 82.00% 82.50% 85.30% 86.86%
SVM 86.20% 67.65% 75.76%
NB 87.50% 52.50% 65.25%
RNN 88.93% 84.10% 86.45%
LSTM 90.79% 85.32% 87.97%
TABLE III
THE R ESULTS O N THE TWI TTER D ATAS ETS
Precision Recall F1-Measure
Model Random Pre-trained Random Pre-trained Random Pre-trained
SVMCNN 87.19% 88.32% 80.03% 82.63% 83.46% 85.38%
CNN 79.58% 81.40% 74.32% 77.30% 76.86% 79.30%
SVM 76.30% 59.50% 66.86%
NB 75.21% 51.34% 61.02%
RNN 80.42% 73.53% 76.82%
LSTM 84.37% 80.10% 82.18%
ability, it can obtain good effect even if the scenarios of
the two datasets are very different. With just fine-tuned,
SVMCNN model can adapt to multiple scenarios.
V. CONCLUSIONS
This paper aims at the problem that SVM is easily
affected by datasets in short text classification, and proposes
to combine CNN with SVM to improve the classification
effect. According to testing with Twitter users remarks, the
results show that the SVMCNN with pre-trained initialization
performs better than other algorithms. SVMCNN can play a
role in the public opinion analysis and sensitive information
identification on the online social platform, which helps to
guide and maintain a safe and pure network environment.
REFERENCES
[1] M.Nii, K.Takahama, A.Uchinuno, and R.Sakashita, ”Soft class de-
cision for nursing-care text classification using a k-nearest neighbor
based system”, in IEEE International Conference on Fuzzy Systems,
2014, pp. 1825-1830.
[2] J.D.M. Rennie, L. Shih, J. Teevan, and D.R. Karger, ”Tackling the
poor assumptions of naive bayes text classifiers”, in Machine Learning,
Proceedings of the Twentieth International Conference (ICML 2003),
2003, pp. 616-623.
[3] D.M. Diab, K.M.E. Hindi, Using differential evolution for fine tuning
na¨
ıve Bayesian classifiers and its application for text classification,
Applied Soft Computing, vol. 54, 2017, pp. 183-199.
[4] S. Xu, Bayesian naive bayes classifiers to text classification, Journal
of Information Science, vol. 44, no. 1, 2018, pp. 48-59.
[5] J. Lilleberg, Y. Zhu, and Y. Zhang, ”Support vector machines and
word2vec for text classification with semantic features”, in 14th
IEEE International Conference on Cognitive Informatics & Cognitive
Computing, 2015, pp. 136-140.
[6] H. Xia, M. Tao, and Y. Wang, ”Sentiment text classification of
customers reviews on the web based on SVM”, in Sixth International
Conference on Natural Computation, 2010, pp. 3633-3637.
[7] S. Wang and C.D. Manning, ”Baselines and bigrams: Simple, good
sentiment and topic classification”, in Proceedings of the 50th Annual
Meeting of the Association for Computational Linguistics: Short
Papers, Vol. 2, 2012, pp. 90-94.
[8] Y. Bengio, R. Ducharme, P. Vincent, et al, A neural probabilistic
language mode, Journal of Machine Learning Research, vol. 3, 2003,
pp. 1137-1155.
[9] G.E. Hinton, ”Learning distributed representations of concepts, in
Proceedings of the eighth annual conference of the cognitive science
society, 1986, pp. 1-12.
[10] G.E. Hinton and R.R. Salakhutdinov, Reducing the dimensionality of
data with neural networks, Science, vol. 313, no. 5786, 2006, pp. 504-
507.
[11] Y. LeCun, Y. Bengio, G.E. Hinton, Deep learning, Nature, vol. 521,
2015, pp. 436-444.
[12] Y. Kim, ”Convolutional neural networks for sentence classification”,
Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing, 2014, pp. 1746-1751.
[13] N. Kalchbrenner, E. Grefenstette, P. A. Blunsom, Convolutional Neural
Network for Modelling Sentences. The Association for Computer
Linguistics, vol. 1, 2014, pp. 655-665.
[14] T. Zhang, C. Li, N. Cao, et al, Text feature extraction and classification
based on convolu-tional neural network (cnn),in Data Science, 2017,
pp. 472-485.
[15] A. Conneau, H. Schwenk, L. Barraul, et al, Very Deep Convolutional
Networks for Text Classification, Association for Computer Linguis-
tics, vol. 1, 2017, pp. 107-1116.
[16] S. Lai, L. Xu, K. Liu, and J. Zhao, ”Recurrent convolutional neural
networks for text classification”, in Proceedings of the Twenty-Ninth
AAAI Conference on Artificial Intelligence, 2015, pp. 2267-2273.
[17] J.Y. Lee and F.Dernoncourt, ”Sequential short-text classification with
recurrent and convolutional neural networks”, in The 2016 Conference
of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, 2016, pp. 515-520.
1435