ArticlePDF Available

Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms

Authors:

Abstract

The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
A preview of the PDF is not available
... For training, positive examples are users who buy at least one item in the category, and an equal number of random negative examples are provided. During testing, for each user in the test set, SVM returns a confidence score [13] which we used for ranking. We use SVM light with a radial basic function (RBF) kernel [3] . ...
Article
Full-text available
Nowadays, many e-commerce websites allow users to login with their existing social networking accounts. When a new user comes to an e-commerce website, it is interesting to study whether the information from external social media platforms can be utilized to alleviate the cold-start problem. In this paper, we focus on a specific task on cross-site information sharing, i.e., leveraging the text posted by a user on the social media platform (termed as social text) to infer his/her purchase preference of product categories on an e-commerce platform. To solve the task, a key problem is how to effectively represent the social text in a way that its information can be utilized on the e-commerce platform. We study two major kinds of text representation methods for predicting cross-site purchase preference, including shallow textual features and deep textual features learned by deep neural network models. We conduct extensive experiments on a large linked dataset, and our experimental results indicate that it is promising to utilize the social text for predicting purchase preference. Specially, the deep neural network approach has shown a more powerful predictive ability when the number of categories becomes large.
... For more details on SVM, the reader is referred to Cristiani and Shawe-Tailor's tutorial [Cristianini and Yi Hu et al. Shawe-Taylor 2000] and Roberto Basili's paper [Basili 2003] ...
Article
This paper presents a generative model based on the language modeling approach for sentiment analysis. By characterizing the semantic orientation of documents as "favorable" (positive) or "unfavorable" (negative), this method captures the subtle information needed in text retrieval. In order to conduct this research, a language model based method is proposed to keep the dependent link between a "term" and other ordinary words in the context of a triggered language model: first, a batch of terms in a domain are identified; second, two different language models representing classifying knowledge for every term are built up from subjective sentences; last, a classifying function based on the generation of a test document is defined for the sentiment analysis. When compared with Support Vector Machine, a popular discriminative model, the language modeling approach performs better on a Chinese digital product review corpus by a 3-fold cross-validation. This result motivates one to consider finding more suitable language models for sentiment detection in future research.
... Machine learning has many applications in real life. It is routinely used in banking (for detecting fraudulent transactions (Dorronsoro et al., 1997)), in finance (to predict stock prices (Huang et al., 2005a)), in marketing (to reveal patterns of consumer spending (Bose and Mahapatra, 2001)), and on the Internet (as part of search engines (Basili, 2003)). In biomedicine, MYCIN was proposed in the early 1970s at Stanford University. ...
Article
In this paper, we give a short introduction to machine learning and survey its applications in radiology. We focused on six categories of applications in radiology: medical image segmentation, registration, computer aided detection and diagnosis, brain function or activity analysis and neurological disease diagnosis from fMR images, content-based image retrieval systems for CT or MRI images, and text analysis of radiology reports using natural language processing (NLP) and natural language understanding (NLU). This survey shows that machine learning plays a key role in many radiology applications. Machine learning identifies complex patterns automatically and helps radiologists make intelligent decisions on radiology data such as conventional radiographs, CT, MRI, and PET images and radiology reports. In many applications, the performance of machine learning-based automatic detection and diagnosis systems has shown to be comparable to that of a well-trained and experienced radiologist. Technology development in machine learning and radiology will benefit from each other in the long run. Key contributions and common characteristics of machine learning techniques in radiology are discussed. We also discuss the problem of translating machine learning applications to the radiology clinical setting, including advantages and potential barriers.
Article
The emerging world-wide e-society creates new ways of interaction between people with different cultures and backgrounds. Communication systems as forums, blogs, and comments are easily accessible to end users. In this context, user generated content management revealed to be a difficult but necessary task. Studying and interpreting user generated data/text available on the Internet is a complex and time consuming task for any human analyst. This study proposes an interdisciplinary approach to modelling the flaming phenomena (hot, aggressive discussions) in online Italian forums. The model is based on the analysis of psycho/cognitive/linguistic interaction modalities among web communities' participants, state-of-the art machine learning techniques and natural language processing technology. Virtual communities' administrators, moderators and users could benefit directly from this research. A further positive outcome of this research is the opportunity to better understand and model the dynamics of web forums as the base for developing opinion mining applications focused on commercial applications.
Conference Paper
Using a set of binary classifiers to solve the multiclass classification problem has been a popular approach over the years. This technique is known as binarization. The decision boundary that these binary classifiers (also called base classifiers) have to learn is much simpler than the decision boundary of a multiclass classifier. But binarization gives rise to a new problem called the class imbalance problem. Class imbalance problem occurs when the data set used for training has relatively less data items for one class than for another class. This problem becomes more severe if the original data set itself was imbalanced. Furthermore, binarization has only been implemented in the domain of supervised classification. In this paper, we propose a framework called Binarization with Boosting and Oversampling (BBO). Our framework can handle the class imbalance problem arising from binarization. As the name of the framework suggests, this is achieved through a combination of boosting and oversampling. BBO framework can be used with any supervised classification algorithm. Moreover, unlike any other binarization approaches used earlier, we apply our framework with semi-supervised classification as well. BBO framework has been rigorously tested with a number of benchmark data sets from UCI machine learning repository. The experimental results show that using the BBO framework achieves a higher accuracy than the traditional binarization approach.
Article
Using a set of binary classifiers to solve multiclass classification problems has been a popular approach over the years. The decision boundaries learnt by binary classifiers (also called base classifiers) are much simpler than those learnt by multiclass classifiers. This paper proposes a new classification framework, termed binarization with boosting and oversampling (BBO), for efficiently solving multiclass classification problems. The new framework is devised based on the one-versus-all (OVA) binarization technique. Unlike most previous work, BBO employs boosting for solving the hard-to-learn instances and oversampling for handling the class-imbalance problem arising due to OVA binarization. These two features make BBO different from other existing works. Our new framework has been tested extensively on several multiclass supervised and semi-supervised classification problems using five different base classifiers, including neural networks, C4.5, k-nearest neighbor, repeated incremental pruning to produce error reduction, support vector machine, random forest, and learning with local and global consistency. Experimental results show that BBO can exhibit better performance compared to its counterparts on supervised and semi-supervised classification problems.
Article
In state-of-the-art region of interest (ROI) based image retrieval systems, the user defined ROI query is considered more effectively reflecting the user’s intention than an ROI query automatically selected by the system. Compared with existing image retrieval method, the user defined ROI based image retrieval has two obvious characteristics: One, the target region is located at the center of the ROI query, and two, the ROI query contains hardly any noisy descriptors which do not belong to the target region. Based on these two characteristics and general bag-of-words image retrieval method, an auxiliary Gaussian weighting (AGW) scheme is incorporated into our ROI based image retrieval system. Each of the descriptor is weighted according to its distance between the center of the ROI query, using a 2-d Gaussian window function. The AGW scheme is used to compute the score of each image in database. Meanwhile, an efficient re-ranking algorithm is proposed based on the distribution consistency of the Gaussian weight between the matched descriptors of the ROI query and the candidate image, which is simply written as the DCGW re-ranking. The experimental results demonstrate that our system can obtain satisfactory retrieval results.
Conference Paper
Full-text available
We propose a method for estimating user activities by analyzing long-term (more than several seconds) acoustic signals represented as acoustic event temporal sequences. The proposed method is based on a probabilistic generative model of an acoustic event temporal sequence that is associated with user activities (e.g. ”cooking”) and subordinate categories of user activities (e.g. ”fry ingredients” or ”plate food”) in which each user activity is represented as a probability distribution over unsupervised subordinate categories of user activities called activity-topics, and each activity-topic is represented as a probability distribution over acoustic events. This probabilistic generative model can express user activities that have more than one subordinate category of the user activities, which a model that takes into account only user activities cannot express adequately. User activity estimation with this model is achieved using a two-step process: frame-by-frame acoustic event estimation to output an acoustic event temporal sequence and user activity estimation with the proposed probabilistic generative model. Activity estimation experiments with real-life sounds indicated that the proposed method improved user activity estimation accuracy and stability of ”unseen” acoustic event temporal sequences. In addition, the experiment showed that the proposed method could extract correct subordinate categories of user activities.
Conference Paper
We propose a model for analyzing acoustic scenes by using long-term (more than several seconds) acoustic signals based on a probabilistic generative model of an acoustic feature sequence associated with acoustic scenes (e.g. ”cooking”) and acoustic events (e.g. ”cutting with a knife,” ”heating a skillet” or ”running water”) called latent acoustic topic and event allocation (LATEA) model. The proposed model allows the analysis of a wide variety of sounds and the capture of abstract acoustic scenes by representing acoustic vents and scenes as latent variables, and can also describe the acoustic similarity and variance between acoustic events by representing acoustic features as a mixture of Gaussian components. Experiments with real-life sounds indicated that the proposed model exhibited lower perplexity than conventional models; it improved the stability of acoustic scene estimation. The experimental results also suggested that the proposed model can better describe the acoustic similarity and variance between acoustic events than conventional models.
Conference Paper
Full-text available
In this paper, we describe a multi-class text categorization approach to authorship attribution and test it on sets of e-mail collections. The PAN 2011 competition data consists of e-mails of variable length, written by various candidate authors, with some represented by significantly longer or more e-mails than others. Rather than construct a classifier for each separate author to discriminate it from the others (i.e. binary classification), we adopt a multi-class scheme where all authorship classes are learned simultaneously. We explore the effect of the selection of feature types and of the C parameter in the SVMmulticlass learning algorithm. Variable-length lexical features showed promising results, nevertheless our authorship attribution approach only scored a mid position amongst the other competitors, for the SMALL as well as the LARGE test sets.
Conference Paper
Full-text available
This note describes a scoring scheme for the coreference task in MUC6. It improves on the original approach by: (1) grounding the scoring scheme in terms of a model; (2) producing more intuitive recall and precision scores; and (3) not requiring explicit computation of the transitive closure of coreference. The principal conceptual difference is that we have moved from a syntactic scoring model based on following coreference links to an approach defined by the model theory of those links.
Conference Paper
Full-text available
In this paper, we apply Support Vector Machines (SVMs) to identify English base phrases (chunks). It is well-known that SVMs achieve high generalization perfor- mance even using input data with a high dimensional feature space. Furthermore, by introducing the Kernel principle, SVMs can carry out training with smaller com- putational cost independent of the dimensionality of the feature space. In order to improve accuracy, we also apply majority voting with 8 SVMs which are trained using distinct chunk representations. Experimental results show that our approach achieves better accuracy than other conventional frameworks.
Article
An abstract is not available.
Article
The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.
Article
Two approaches to the problem of resolving pronoun references are presented. The first is a naive algorithm that works by traversing the surface parse trees of the sentences of the text in a particular order looking for noun phrases of the correct gender and number. The algorithm clearly does not work in all cases, but the results of an examination of several hundred examples from published texts show that it performs remarkably well.In the second approach, it is shown how pronoun solution can be handled in a comprehensive system for semantic analysis of English texts. The system is described, and it is shown in a detailed treatment of several examples how semantic analysis locates the antecedents of most pronouns as a by-product. Included are the classic examples of Winograd and Charniak.
Conference Paper
Expert Network (ExpNet) is our approach to automatic categorization and retrieval of natural language texts. We use a training set of texts with expert assigned categories to construct a network which approximately reflects the conditional probabilities of categories given a text. The input nodes of the network are words in the training texts, the nodes on the intermediate level are the training texts, and the output nodes are categories. The links between nodes are computed based on statistics of the word distribution and the category distribution over the training set. ExpNet is used for relevance ranking of candidate categories of an arbitrary text in the case of text categorization, and for relevance ranking of documents via categories in the case of text retrieval. We have evaluated ExpNet in categorization and retrieval on a document collection of the MEDLINE database, and observed a performance in recall and precision comparable to the Linear Least Squares Fit (LLSF) mapping method, and significantly better than other methods tested. Computationally, ExpNet has an O(N log N) time complexity which is much more efficient than the cubic complexity of the LLSF method. The simplicity of the model, the high recall precision rates, and the efficient computation together make ExpNet preferable as a practical solution for real world applications.
Conference Paper
In this paper, a Bayesian inference network model for automatic indexing with index terms (descriptors) from a prescribed vocabulary is presented. It requires an indexing dictionary with rules mapping terms of the respective subject field onto descriptors and inverted lists for terms occurring in a set of documents of the subject field and descriptors manually assigned to these documents. The indexing dictionary can be derived automatically from a set of manually indexed documents. An application of the network model is described, followed by an indexing example and some experimental results about the indexing performance of the network model.