ArticlePDF Available

Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms

January 2003
Computational Linguistics 29(4):655-661

January 2003
29(4):655-661

Source
DBLP

Authors:

University of Rome Tor Vergata

The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.

Content uploaded by Roberto Basili

Content may be subject to copyright.

A preview of the PDF is not available

An Experimental Study of Text Representation Methods for Cross-Site Purchase Preference Prediction Using the Social Text Data

Article

Full-text available

Jul 2017

Nowadays, many e-commerce websites allow users to login with their existing social networking accounts. When a new user comes to an e-commerce website, it is interesting to study whether the information from external social media platforms can be utilized to alleviate the cold-start problem. In this paper, we focus on a specific task on cross-site information sharing, i.e., leveraging the text posted by a user on the social media platform (termed as social text) to infer his/her purchase preference of product categories on an e-commerce platform. To solve the task, a key problem is how to effectively represent the social text in a way that its information can be utilized on the e-commerce platform. We study two major kinds of text representation methods for predicting cross-site purchase preference, including shallow textual features and deep textual features learned by deep neural network models. We conduct extensive experiments on a large linked dataset, and our experimental results indicate that it is promising to utilize the social text for predicting purchase preference. Specially, the deep neural network approach has shown a more powerful predictive ability when the number of categories becomes large.

Using a Generative Model for Sentiment Analysis

Article

Jul 2007

This paper presents a generative model based on the language modeling approach for sentiment analysis. By characterizing the semantic orientation of documents as "favorable" (positive) or "unfavorable" (negative), this method captures the subtle information needed in text retrieval. In order to conduct this research, a language model based method is proposed to keep the dependent link between a "term" and other ordinary words in the context of a triggered language model: first, a batch of terms in a domain are identified; second, two different language models representing classifying knowledge for every term are built up from subjective sentences; last, a classifying function based on the generation of a test document is defined for the sentiment analysis. When compared with Support Vector Machine, a popular discriminative model, the language modeling approach performs better on a Chinese digital product review corpus by a 3-fold cross-validation. This result motivates one to consider finding more suitable language models for sentiment detection in future research.

Machine Learning and Radiology

Article

Feb 2012

In this paper, we give a short introduction to machine learning and survey its applications in radiology. We focused on six categories of applications in radiology: medical image segmentation, registration, computer aided detection and diagnosis, brain function or activity analysis and neurological disease diagnosis from fMR images, content-based image retrieval systems for CT or MRI images, and text analysis of radiology reports using natural language processing (NLP) and natural language understanding (NLU). This survey shows that machine learning plays a key role in many radiology applications. Machine learning identifies complex patterns automatically and helps radiologists make intelligent decisions on radiology data such as conventional radiographs, CT, MRI, and PET images and radiology reports. In many applications, the performance of machine learning-based automatic detection and diagnosis systems has shown to be comparable to that of a well-trained and experienced radiologist. Technology development in machine learning and radiology will benefit from each other in the long run. Key contributions and common characteristics of machine learning techniques in radiology are discussed. We also discuss the problem of translating machine learning applications to the radiology clinical setting, including advantages and potential barriers.

Flames recognition for opinion mining

Article

Jan 2011

The emerging world-wide e-society creates new ways of interaction between people with different cultures and backgrounds. Communication systems as forums, blogs, and comments are easily accessible to end users. In this context, user generated content management revealed to be a difficult but necessary task. Studying and interpreting user generated data/text available on the Internet is a complex and time consuming task for any human analyst. This study proposes an interdisciplinary approach to modelling the flaming phenomena (hot, aggressive discussions) in online Italian forums. The model is based on the analysis of psycho/cognitive/linguistic interaction modalities among web communities' participants, state-of-the art machine learning techniques and natural language processing technology. Virtual communities' administrators, moderators and users could benefit directly from this research. A further positive outcome of this research is the opportunity to better understand and model the dynamics of web forums as the base for developing opinion mining applications focused on commercial applications.

An algorithmic framework based on the binarization approach for supervised and semi-supervised multiclass problems

Conference Paper

Jul 2014

Using a set of binary classifiers to solve the multiclass classification problem has been a popular approach over the years. This technique is known as binarization. The decision boundary that these binary classifiers (also called base classifiers) have to learn is much simpler than the decision boundary of a multiclass classifier. But binarization gives rise to a new problem called the class imbalance problem. Class imbalance problem occurs when the data set used for training has relatively less data items for one class than for another class. This problem becomes more severe if the original data set itself was imbalanced. Furthermore, binarization has only been implemented in the domain of supervised classification. In this paper, we propose a framework called Binarization with Boosting and Oversampling (BBO). Our framework can handle the class imbalance problem arising from binarization. As the name of the framework suggests, this is achieved through a combination of boosting and oversampling. BBO framework can be used with any supervised classification algorithm. Moreover, unlike any other binarization approaches used earlier, we apply our framework with semi-supervised classification as well. BBO framework has been rigorously tested with a number of benchmark data sets from UCI machine learning repository. The experimental results show that using the BBO framework achieves a higher accuracy than the traditional binarization approach.

Binarization With Boosting and Oversampling for Multiclass Classification

Article

May 2015

Using a set of binary classifiers to solve multiclass classification problems has been a popular approach over the years. The decision boundaries learnt by binary classifiers (also called base classifiers) are much simpler than those learnt by multiclass classifiers. This paper proposes a new classification framework, termed binarization with boosting and oversampling (BBO), for efficiently solving multiclass classification problems. The new framework is devised based on the one-versus-all (OVA) binarization technique. Unlike most previous work, BBO employs boosting for solving the hard-to-learn instances and oversampling for handling the class-imbalance problem arising due to OVA binarization. These two features make BBO different from other existing works. Our new framework has been tested extensively on several multiclass supervised and semi-supervised classification problems using five different base classifiers, including neural networks, C4.5, k-nearest neighbor, repeated incremental pruning to produce error reduction, support vector machine, random forest, and learning with local and global consistency. Experimental results show that BBO can exhibit better performance compared to its counterparts on supervised and semi-supervised classification problems.

A new ROI based image retrieval system using an auxiliary Gaussian weighting scheme

Article

Dec 2013

In state-of-the-art region of interest (ROI) based image retrieval systems, the user defined ROI query is considered more effectively reflecting the user’s intention than an ROI query automatically selected by the system. Compared with existing image retrieval method, the user defined ROI based image retrieval has two obvious characteristics: One, the target region is located at the center of the ROI query, and two, the ROI query contains hardly any noisy descriptors which do not belong to the target region. Based on these two characteristics and general bag-of-words image retrieval method, an auxiliary Gaussian weighting (AGW) scheme is incorporated into our ROI based image retrieval system. Each of the descriptor is weighted according to its distance between the center of the ROI query, using a 2-d Gaussian window function. The AGW scheme is used to compute the score of each image in database. Meanwhile, an efficient re-ranking algorithm is proposed based on the distribution consistency of the Gaussian weight between the matched descriptors of the ROI query and the candidate image, which is simply written as the DCGW re-ranking. The experimental results demonstrate that our system can obtain satisfactory retrieval results.

User Activity Estimation Method Based on Probabilistic Generative Model of Acoustic Event Sequence with User Activity and Its Subordinate Categories

Conference Paper

Full-text available

Aug 2013

We propose a method for estimating user activities by analyzing long-term (more than several seconds) acoustic signals represented as acoustic event temporal sequences. The proposed method is based on a probabilistic generative model of an acoustic event temporal sequence that is associated with user activities (e.g. ”cooking”) and subordinate categories of user activities (e.g. ”fry ingredients” or ”plate food”) in which each user activity is represented as a probability distribution over unsupervised subordinate categories of user activities called activity-topics, and each activity-topic is represented as a probability distribution over acoustic events. This probabilistic generative model can express user activities that have more than one subordinate category of the user activities, which a model that takes into account only user activities cannot express adequately. User activity estimation with this model is achieved using a two-step process: frame-by-frame acoustic event estimation to output an acoustic event temporal sequence and user activity estimation with the proposed probabilistic generative model. Activity estimation experiments with real-life sounds indicated that the proposed method improved user activity estimation accuracy and stability of ”unseen” acoustic event temporal sequences. In addition, the experiment showed that the proposed method could extract correct subordinate categories of user activities.

Acoustic Scene Analysis Based on Latent Acoustic Topic and Event Allocation

Conference Paper

Sep 2013

We propose a model for analyzing acoustic scenes by using long-term (more than several seconds) acoustic signals based on a probabilistic generative model of an acoustic feature sequence associated with acoustic scenes (e.g. ”cooking”) and acoustic events (e.g. ”cutting with a knife,” ”heating a skillet” or ”running water”) called latent acoustic topic and event allocation (LATEA) model. The proposed model allows the analysis of a wide variety of sounds and the capture of abstract acoustic scenes by representing acoustic vents and scenes as latent variables, and can also describe the acoustic similarity and variance between acoustic events by representing acoustic features as a mixture of Gaussian components. Experiments with real-life sounds indicated that the proposed model exhibited lower perplexity than conventional models; it improved the stability of acoustic scene estimation. The experimental results also suggested that the proposed model can better describe the acoustic similarity and variance between acoustic events than conventional models.

Authorship Identification of E-mail as a Multi-Class Task - Notebook for PAN at CLEF 2011.

Conference Paper

Full-text available

Jan 2011

Kim Luyckx

In this paper, we describe a multi-class text categorization approach to authorship attribution and test it on sets of e-mail collections. The PAN 2011 competition data consists of e-mails of variable length, written by various candidate authors, with some represented by significantly longer or more e-mails than others. Rather than construct a classifier for each separate author to discriminate it from the others (i.e. binary classification), we adopt a multi-class scheme where all authorship classes are learned simultaneously. We explore the effect of the selection of feature types and of the C parameter in the SVMmulticlass learning algorithm. Variable-length lexical features showed promising results, nevertheless our authorship attribution approach only scored a mid position amongst the other competitors, for the SMALL as well as the LARGE test sets.

A model-theoretic coreference scoring scheme

Conference Paper

Full-text available

Jan 1995

This note describes a scoring scheme for the coreference task in MUC6. It improves on the original approach by: (1) grounding the scoring scheme in terms of a model; (2) producing more intuitive recall and precision scores; and (3) not requiring explicit computation of the transitive closure of coreference. The principal conceptual difference is that we have moved from a syntactic scoring model based on following coreference links to an approach defined by the model theory of those links.

Chunking with Support Vector Machines

Conference Paper

Full-text available

Jun 2001

In this paper, we apply Support Vector Machines (SVMs) to identify English base phrases (chunks). It is well-known that SVMs achieve high generalization perfor- mance even using input data with a high dimensional feature space. Furthermore, by introducing the Kernel principle, SVMs can carry out training with smaller com- putational cost independent of the dimensionality of the feature space. In order to improve accuracy, we also apply majority voting with 8 SVMs which are trained using distinct chunk representations. Experimental results show that our approach achieves better accuracy than other conventional frameworks.

Relevance Feedback in Information Retrieval. Smart System-Experiments in Automatic Document Process

Article

Jan 1971

Rocchio JJ

Interpreting Anaphora in Natural Language Texts

Article

Jan 1987

David Carter

An abstract is not available.

Text retrieval and filtering: Analytic models of performance

Book

Jan 1998

Robert M Losee

Combining Labeled and Unlabeld Data with Co-Training

Conference Paper

Jan 1998

Induction of Decision Trees

Article

Mar 1986

Ross Quinlan

The technology for building knowledge-based systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.

Resolving pronoun references

Article

Apr 1978
LINGUA

Jerry R. Hobbs

Two approaches to the problem of resolving pronoun references are presented. The first is a naive algorithm that works by traversing the surface parse trees of the sentences of the text in a particular order looking for noun phrases of the correct gender and number. The algorithm clearly does not work in all cases, but the results of an examination of several hundred examples from published texts show that it performs remarkably well.In the second approach, it is shown how pronoun solution can be handled in a comprehensive system for semantic analysis of English texts. The system is described, and it is shown in a detailed treatment of several examples how semantic analysis locates the antecedents of most pronouns as a by-product. Included are the classic examples of Winograd and Charniak.

Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval

Conference Paper

Jan 1994

Yiming Yang

Expert Network (ExpNet) is our approach to automatic categorization and retrieval of natural language texts. We use a training set of texts with expert assigned categories to construct a network which approximately reflects the conditional probabilities of categories given a text. The input nodes of the network are words in the training texts, the nodes on the intermediate level are the training texts, and the output nodes are categories. The links between nodes are computed based on statistics of the word distribution and the category distribution over the training set. ExpNet is used for relevance ranking of candidate categories of an arbitrary text in the case of text categorization, and for relevance ranking of documents via categories in the case of text retrieval. We have evaluated ExpNet in categorization and retrieval on a document collection of the MEDLINE database, and observed a performance in recall and precision comparable to the Linear Least Squares Fit (LLSF) mapping method, and significantly better than other methods tested. Computationally, ExpNet has an O(N log N) time complexity which is much more efficient than the cubic complexity of the LLSF method. The simplicity of the model, the high recall precision rates, and the efficient computation together make ExpNet preferable as a practical solution for real world applications.

Automatic Indexing Based on Bayesian Inference Networks.

Conference Paper

Jan 1993

In this paper, a Bayesian inference network model for automatic indexing with index terms (descriptors) from a prescribed vocabulary is presented. It requires an indexing dictionary with rules mapping terms of the respective subject field onto descriptors and inverted lists for terms occurring in a set of documents of the subject field and descriptors manually assigned to these documents. The indexing dictionary can be derived automatically from a set of manually indexed documents. An application of the network model is described, followed by an indexing example and some experimental results about the indexing performance of the network model.

Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms

Abstract

Recommended publications

Evolutionary learning of document categories

Lessons Learned: Transcending Computerphobia: Clinical documentation systems are implemented more sm...

The Relationship Between Children’s Concept of Word in Text and Phoneme Awareness in Learning to Rea...

Oportunidades de aprendizaje sobre isometrías en una discusión en gran grupo con GeoGebra. Oportunit...