ArticlePDF Available

Estimating Age on Twitter Using Self-Training Semi-Supervised SVM

May 2016
Journal of Robotics Networking and Artificial Life 3(1):24

May 2016
3(1):24

DOI:10.2991/jrnal.2016.3.1.6

License
CC BY-NC

Authors:

Satoshi Endo

University of the Ryukyus

Koji Yamada

University of the Ryukyus

N. Toma

University of the Ryukyus

Show all 5 authorsHide

The estimation methods for Twitter user’s attributes typically require a vast amount of labeled data. Therefore, an efficient way is to tag the unlabeled data and add it to the set. We applied the self-training SVM as a semisupervised method for age estimation and introduced Plat scaling as the unlabeled data selection criterion in the self-training process. We show how the performance of the self-training SVM varies when the amount of training data and the selection criterion values are changed.

Available via license: CC BY-NC

Content may be subject to copyright.

Estimating Age on Twitter Using Self-Training Semi-Supervised SVM

Tats uy u k i I ju

Graduate School of Information Engineering, University of the Ryukyus

1 Senbaru, Nishihara-cho, Nakagami-gun, Okinawa, Japan

Satoshi Endo, Koji Yamada Naruaki Toma, Yuhei Akamine

School of Information Engineering, University of the Ryukyus

1 Senbaru, Nishihara-cho, Nakagami-gun, Okinawa, Jaapn

E-mail: k148582@ie.u-ryukyu.ac.jp Endo@ie.u-ryukyu.ac.jp

http://www.u-ryukyu.ac.jp/en/

Abstract

The estimation methods for Twitter user’s attributes typically require a vast amount of labeled data. Therefore, an

efficient way is to tag the unlabeled data and add it to the set. We applied the self-training SVM as a semi-

supervised method for age estimation and introduced Plat scaling as the unlabeled data selection criterion in the

self-training process. We show how the performance of the self-training SVM varies when the amount of training

data and the selection criterion values are changed.

Keywords: Twitter, Age, Semi-supervised learning, Self-training, SVM, Plat scaling

1. Introduction

Nowadays, the use of Twitter as a social activity sensor

has become popular trendy. Although it is more

efficient to consider attribute differences such as user

age and gender for analysis, users rarely share their

personal information to the public. Therefore, a variety

of methods for estimating Twitter user’s attributes has

been studied[1][2][3]. However, these methods require a

vast amount of labeled data. Since collecting labeled

data is typically a high cost work, the estimation method

is efficient when unlabeled data is labeled and used as

additional data. We investigate a method for building

classifier by using self-training SVM, which is a

combination of semi-supervised method, self-training

and SVM. In this study, we formulate age estimation as

a binary classification problem where user is labeled as

under 30 or over 30. The method for user vectorization

is simple bag-of-words model and all the tweets treated

as data were Japanese tweets. In section 2, we describe

self-training SVMs. In section3, we describe the

experiments and present and analyze the results. Finally,

we describe our findings and the extension of this work.

2. Self-training SVM algorithm

We describe about the method of building our classifier

by using a self-training SVM. Self-training is a simple

semi-supervised learning algorithm, with examples of

applications started by Scudder[4]. The standard

approach for self-training is as follows.

i. By Using underlying learning algorithm, train a

classifier from the labeled data set.

ii. Label a part of the unlabeled data set according to

the classifier, and retrain it, with the newly labeled

data as an additional training set.

We construct our classifier for Twitter user’s age

estimation by using a self-training SVM, in which the

underlying learning algorithm is SVM. Furthermore, we

introduce Plat scaling as a criterion for selecting users

Journal of Robotics, Networking and Artificial Life, Vol. 3, No. 1 (June 2016), 24-27

Published by Atlantis Press

Tatsuyuki Iju,Satoshi Endo, Koji Yamada Naruaki Toma, Yuhei Akamine

appropriately from the unlabeled set in order to be

labeled. It is expected that poor quality data is filtered

out by that criterion. Plat scaling[5] is a method used for

modeling a function that returns posterior probability

(|) in which user  is in the  class,

according to classifier’s the decision function.

The steps of the self-training SVM algorithm where

sample set

{,= 1, … , } and sample set

{,=

1, … , } belongs to training set  and unlabeled set

, respectively.

i. Using , we train a SVM and obtain probability

(|){(|),= 1, … , 

}that

sample  belongs to  by Plat scaling.

ii. Define  containing all with at least one of

the (|)  . Furthermore,

Define sample in  as .

iii. Define new as  = +. The label of 

is predicted as the  for which (|) is

highest. Furthermore, Define new  as =

  .

iv. Repeat i. ~ iii. until  cannot be defined.

3. Experiments

3.1. Experimental contents and settings

We defined age as two classes, under 30 and over 30.

This definition is the same as Rao’s research[1]. we

carried out some experiments in order to evaluate the

performance of the Self-training SVM for the Twitter

user’s age estimation. By using self-training SVM, we

built a classifier for each of the nine training set arrays

(three sets of different number of users containing 76,

256 and 376 with three possible selection criterions of

0.5, 0.7 and 0.9 respectively). Then, we measure the

classifier’s performance with the test set. Additionally,

in order to provide the baselines for these classifiers, we

built other classifiers by using normal SVM for each of

four training set arrays and measured their performance

the same way, we did for the self-training SVM ones.

The performance measurement was done by the five-

fold cross validation. The number of users in the test set

was 480 and the number of users in the unlabeled set

was 1200, In addition, the number of users from under

30 and over 30 classes were balanced in both the

training and test sets. The way for user vectorization

was simple bag-of-words model, at the time of classifier

training and user’s age prediction. Prior to proceeding

the experiments, it was necessary to set hyper

parameters of SVM and features in a bag-of-words

representation. Therefore, we performed a grid-search,

set the kernel and cost parameters of SVM as linear and

1000 respectively. As for the features, we set the 158

top-ranked words by  score which appeared in user

tweets from the training set.

3.2. Results

First, we describe about the results for the normal

SVMs. Table 1 shows the performance of the normal

SVM classifiers. From Table 1, we can assure that the

classifiers can improve their performance as the size of

the set is increased, at least when the size of the given

training set is in the 76 to 376 range. According to this

result, as well as for self-training SVM, it is expected

that performance can be improved by utilizing

unlabeled set with the training set size within the same

range size. Additionally, regardless of the training set

size, it was observed that precision is high and recall is

low for the under 30 class and, oppositely the precision

is low and recall is high for the over 30 class. F measure

was better for the under 30 class than for over 30 one. It

is, basically, that the under 30 class user was easier to

predict than the over 30 class one. As for the results

about self-training SVM, Table 2, 3, 4, and correspond

to the prediction results from classifiers built from 76,

256 and 376 users training sets, respectively. The

classifier with the highest improvement from baseline

was the one with 76 users training set with a selection

criterion of 0.9 (improvement was 0.032 point for the

over30 class recall). Classifier from the 376 users with a

selection criterion of 0.9 was the second highest

(improvement was 0.025 point for the over 30 class

recall). Although the recall improvement for over 30

class is remarkable, precision for the over 30 class and

both precision and recall for the under 30 class subtly

improve or diminish as the mean F measure merely

improves. Ta b le 5 shows the labeling accuracy for the

76 users set self-training and ratio of the labeled set for

all the unlabeled set. Precision, F and mean F measures

for the over 30 class were not defined, because the

entire unlabeled set is filtered and eventually labeled as

the under 30 class. We could observe better labeling

accuracy for higher selection criterions and, in contrast,

for bigger selection criterions the amount of labeled data

from the unlabeled set is reduced. However, as indicated

by Table 2 through Table 4, performance was better for

a selection criterion of 0.5 than 0.7 for all of classifiers

Published by Atlantis Press

Estimating Age on Twitter

excluding the 376 users training set, although

performance for a selection criterion of 0.9 was best for

all classifiers. This result is inconsistent with the fact

that the labeling error is more frequent for selection

criterion of 0.5 than 0.7 as indicated in Table 5. For that

reason, it is implicated that users are easier to predict if

selected from a self-training process with a selection

criterion of 0.7 than a 0.5 one, since the classifier with a

selection criterion of 0.7 is more strongly affected by

labeling error than a 0.5 one. In addition, it is inferable

that, for a selection criterion of 0.9, the labeling error

rate is so small that the classifier achieves in improving

its performance.

Table 1: Results for the normal SVM

Number of the users in training set

256

376

Age class

Under 30

Over 30

Under 30

Over 30

Under 30

Over 30

Precision

0.613

0.792

0.638

0.825

0.647

0.834

Recall

0.882

0.441

0.896

0.492

0.898

0.509

F measure

0.723

0.563

0.745

0.616

0.752

0.632

Mean

F measure

0.643

0.680

0.692

Table 4: Results for the training set with 376 number of users

Unlabeled data selection criterion

0.5

0.7

0.9

Age class

Under 30

Over 30

Under 30

Over 30

Under 30

Over 30

Precision

0.647

0.834

0.649

0.825

0.657

0.829

Recall

0.898

0.510

0.891

0.516

0.890

0.534

F measure

0.752

0.632

0.750

0.634

0.756

0.650

Mean

F measure

0.692

0.703

Table 3: Results for the training set with 256 number of users

Unlabeled data selection criterion

0.5

0.7

0.9

Age class

Under 30

Over 30

Under 30

Over 30

Under 30

Over 30

Precision

0.638

0.825

0.630

0.818

0.645

0.808

Recall

0.896

0.492

0.895

0.473

0.876

0.512

F measure

0.745

0.616

0.739

0.599

0.742

0.629

Mean

F measure

0.680

0.669

0.686

Table 2: Results for the training set with 76 number of users

Unlabeled data selection criterion

0.5

0.7

0.9

Age class

Under 30

Over 30

Under 30

Over 30

Under 30

Over 30

Precision

0.614

0.792

0.610

0.793

0.625

0.791

Recall

0.882

0.442

0.888

0.430

0.901

0.473

F measure

0.723

0.563

0.722

0.554

0.728

0.590

Mean

F measure

0.643

0.639

0.659

Published by Atlantis Press

Tatsuyuki Iju,Satoshi Endo, Koji Yamada Naruaki Toma, Yuhei Akamine

4. Conclusions

In order to evaluate self-training SVM for Twitter user’s

age estimation, we construct a classifier by for each of

the twelve training set arrays of (three sets of different

users containing 76, 256 and 376 with three possible

selection criterion of 0.5, 0.7 and 0.9 respectively). Then

we evaluate the performance of the classifiers with test

set. As a result, in the recall for the over 30 class, it was

observed a 0.032 and 0.025 point of improvement from

baseline for training set with a size of 76 and 376

respectively with a selection criterion of 0.9. For future

works, we will investigate the relation between the

selection criterion and performance of self-training

SVM as well as explore the way to improve them.

References

1. Rao, Delip, et al. "Classifying latent user attributes in

twitter." Proceedings of the 2nd international workshop

on Search and mining user-generated contents. ACM,

2010.

2. Burger, John D., et al. "Discriminating gender on

Twitter." Proceedings of the Conference on Empirical

Methods in Natural Language Processing. Association

for Computational Linguistics, 2011.

3. Pennacchiotti, Marco, and Ana-Maria Popescu.

"Democrats, republicans and starbucks afficionados: user

classification in twitter." Proceedings of the 17th ACM

SIGKDD international conference on Knowledge

discovery and data mining. ACM, 2011.

4. Scudder, Henry J. "Probability of error of some adaptive

pattern-recognition machines." Information Theory, IEEE

Transactions on 11.3 (1965): 363-371.

5. Platt, John. "Probabilistic outputs for support vector

machines and comparisons to regularized likelihood

methods." Advances in large margin classifiers 10.3

(1999): 61-74.

Table 5: Results for the training set 76 number of users (labeling)

Unlabeled data selection criterion

0.5

0.7

0.9

Age class

Under 30

Over 30

Under 30

Over 30

Under 30

Over 30

Precision

0.621

0.722

0.699

0.752

0.705

Recall

1.0

0.890

0.425

0.90

0.440

F measure

0.766

0.796

0.524

0.820

0.537

Mean

F measure

0.660

0.678

Labeling rate

1.0

0.97

0.83

Published by Atlantis Press

Artificial Intelligence-Based Semisupervised Self-Training Algorithm in Pathological Tissue Image Segmentation

Article

Full-text available

Jun 2022
Comput Intell Neurosci

In the field of medical image processing, due to the differences in tissues, organs, and imaging methods, obtained medical images have significant differences. With the development of intelligence in medicine, an increasing number of computing optimization algorithms based on AI technology have also been applied to the field of medicine. Because the image segmentation algorithm based on the semisupervised self-training algorithm solves initialization class center large randomness problem in the traditional cluster-based image segmentation algorithm, this article aims to integrate the artificial intelligence semisupervised self-training algorithm into the pathological tissue image segmentation problem. An experimental group is designed to collect sample images and the algorithm proposed in this article is used to perform image segmentation to achieve a better visual experience and images. Although there is no general image segmentation theory, many scholars have been committed to applying new concepts and new methods to image segmentation in recent years and combining specific theoretical image segmentation methods has achieved good application results in image segmentation. For example, wavelet analysis, wavelet transform, neural networks, and genetic algorithms can effectively improve the segmentation effect. The results of the Seg cutting method designed in this article show that, in retinal blood vessel segmentation results on a database of healthy people, the sensitivity value is 0.941633, the false-positive rate is 0.952933, the specificity is 0.956787, and the accuracy rate is 0.96182, which are all higher than those in other methods. Image cutting methods such as FNN, CNN, and AWN have addressed the case tissue image cutting problem. Using the Seg cutting method designed in this article to segment the retinal blood vessels on a diabetes patient database, the sensitivity value is 0.8106, the false-positive rate is 0.0511, the specificity is 0.9712, the accuracy is 0.9421, and the false-positive rate is omitted. The false-positive rate is lower than AWN, and other indicators are higher than FNN, CNN, AWN, and other image cutting methods. The application of artificial intelligence-based semisupervised self-training algorithms in pathological tissue image segmentation is realized.

A Systematic Approach to Identify an Appropriate Classifier for Limited-Sized Data Sets

Conference Paper

Jun 2019

Brain State Decoding Based on fMRI Using Semisupervised Sparse Representation Classifications

Article

Full-text available

Apr 2018
Comput Intell Neurosci

Multivariate classification techniques have been widely applied to decode brain states using functional magnetic resonance imaging (fMRI). Due to variabilities in fMRI data and the limitation of the collection of human fMRI data, it is not easy to train an efficient and robust supervised-learning classifier for fMRI data. Among various classification techniques, sparse representation classifier (SRC) exhibits a state-of-the-art classification performance in image classification. However, SRC has rarely been applied to fMRI-based decoding. This study aimed to improve SRC using unlabeled testing samples to allow it to be effectively applied to fMRI-based decoding. We proposed a semisupervised-learning SRC with an average coefficient (semiSRC-AVE) method that performed the classification using the average coefficient of each class instead of the reconstruction error and selectively updated the training dataset using new testing data with high confidence to improve the performance of SRC. Simulated and real fMRI experiments were performed to investigate the feasibility and robustness of semiSRC-AVE. The results of the simulated and real fMRI experiments showed that semiSRC-AVE significantly outperformed supervised learning SRC with an average coefficient (SRC-AVE) method and showed better performance than the other three semisupervised learning methods.

Discriminating Gender on Twitter

Conference Paper

Full-text available

Jan 2011

Accurate prediction of demographic attributes from social media and other informal online content is valuable for marketing, personalization, and legal investigation. This paper describes the construction of a large, multilingual dataset labeled with gender, and investigates statistical models for determining the gender of uncharacterized Twitter users. We explore several different classifier types on this dataset. We show the degree to which classifier accuracy varies based on tweet volumes as well as when various kinds of profile metadata are included in the models. We also perform a large-scale human assessment using Amazon Mechanical Turk. Our methods significantly out-perform both baseline models and almost all humans on the same task.

Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods

Article

Full-text available

Jun 2000

John C. Platt

The output of a classifier should be a calibrated posterior probability to enable post-processing. Standard SVMs do not provide such probabilities. One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. However, training with a maximum likelihood score will produce non-sparse kernel machines. Instead, we train an SVM, then train the parameters of an additional sigmoid function to map the SVM outputs into probabilities. This chapter compares classification error rate and likelihood scores for an SVM plus sigmoid versus a kernel method trained with a regularized likelihood error function. These methods are tested on three data-mining-style data sets. The SVM+sigmoid yields probabilities of comparable quality to the regularized maximum likelihood kernel method, while still retaining the sparseness of the SVM.

Democrats, republicans and starbucks afficionados: User classification in twitter

Conference Paper

Aug 2011

More and more technologies are taking advantage of the explosion of social media (Web search, content recommendation services, marketing, ad targeting, etc.). This paper focuses on the problem of automatically constructing user profiles, which can significantly benefit such technologies. We describe a general and robust machine learning framework for large-scale classification of social media users according to dimensions of interest. We report encouraging experimental results on 3 tasks with different characteristics: political affiliation detection, ethnicity identification and detecting affinity for a particular business.

Probability of Error of Some Adaptive Pattern-Recognition Machines

Article

Aug 1965

H. Scudder III

A simple taught pattern-recognition machine for detecting an unknown, fixed, randomly occurring pattern is derived using a Bayes' approach, and its probability of error is analyzed. It is shown that with probability one, the machine converges to the optimal detector (a matched filter) for the unknown pattern, that the asymptotic decision function statistics are Gaussian, and that for an important class of problems, the central-limit theorem can be invoked to calculate the approximate probability of error at any stage of convergence. An untaught adaptive pattern-recognition machine may be made from the taught machine by using its own output instead of a teacher, and the asymptotic probability of error of this device is derived. It is shown that it does not converge to a matched filter for the unknown pattern, but that in any practical case it performs almost as well. Finally, the results of an experimental simulation of both machines are presented as curves of the relative frequency of error vs. time, and are compared with the values calculated by theory.

Classifying latent user attributes in twitter

Jan 2010

Delip Rao

Rao, Delip, et al. "Classifying latent user attributes in twitter." Proceedings of the 2nd international workshop on Search and mining user-generated contents. ACM, 2010.

Estimating Age on Twitter Using Self-Training Semi-Supervised SVM

Abstract

Recommended publications

Weighted Hyper-sphere SVM for Hypertext Classification

Label diagnosis through self tuning forweb image search

Updating a Name Tagger Using Contemporary Unlabeled Data.

Hierarchically-Refined Label Attention Network for Sequence Labeling