Conference Paper

Categorical Proportional Difference: A Feature Selection Method for Text Categorization.

January 2008

January 2008
87:201-208

Source
DBLP

Conference: Data Mining and Analytics 2008, Proceedings of the Seventh Australasian Data Mining Conference (AusDM 2008). Glenelg/Adelaide, SA, Australia, 27-28 November 2008, Proceedings

Authors:

Supervised text categorization is a machine learning task where a predefined category label is automati- cally assigned to a previously unlabelled document based upon characteristics of the words contained in the document. Since the number of unique words in a learning task (i.e., the number of features) can be very large, the efficiency and accuracy of the learning task can be increased by using feature selection methods to extract from a document a subset of the features that are considered most relevant. In this paper, we introduce a new feature selection method called categorical proportional difference (CPD), a measure of the degree to which a word contributes to differentiating a particular category from other categories. The CPD for a word in a particular category in a text corpus is a ratio that considers the number of documents of a category in which the word occurs and the number of documents from other categories in which the word also occurs. We conducted a series of experiments to evaluate CPD when used in conjunction with SVM and Naive Bayes text classifiers on the OHSUMED, 20 Newsgroups, and Reuters-21578 text corpora. Recall, precision, and the F-measure were used as the measures of performance. The results obtained using CPD were compared to those obtained using six common feature selection methods found in the literature: �2, information gain, document frequency, mutual information, odds ratio, and simplified �2. Empirical results showed that, in general, according to the F-measure, CPD outperforms the other feature se- lection methods in four out of six text categorization tasks.

Feature Selection for Text Categorisation

Article

Øystein Løhre Garnes

Sentiment Analysis on Twitter Data: Comparative Study on Different Approaches

Article

Full-text available

Aug 2021
IJISA

Social media has become incredibly popular these days for communicating with friends and for sharing opinions. According to current statistics, almost 2.22 billion people use social media in 2016, which is roughly one third of the world population and three times of the entire population in Europe. In social media people share their likes, dislikes, opinions, interests, etc. so it is possible to know about a person’s thoughts about a specific topic from the shared data in social media. Since, twitter is one of the most popular social media in the world; it is a very good source for opinion mining and sentiment analysis about different topics. In this research, SVM with different kernel functions and Adaboost are experimented using CPD and Chi-square feature extraction techniques to explore the best sentiment classification model. The reported average accuracy of Adaboost for Chi-square and CPD are 70.2% and 66.9%. The SVM radial basis kernel and polynomial kernel with Chi-square n-grams reported average accuracy of 73.73% and 68.67% respectively. Among the performed experimentation, SVM sigmoid kernel with Chi-square n-grams provided the maximum accuracy that is 74.4%.

Novel feature selection approaches for improving the performance of sentiment classification

Article

Full-text available

Aug 2020

Text based social media has become one of important communication tools between customers and enterprises. In social media, users can easily express their opinions and evaluation regarding products or services. These online user experiences, especially negative evaluations indeed affect other consumers’ behaviors. Consequently, to effectively identify customers’ sentiments and avoid these negative comments to bring a great damage to enterprisers has become one of critical issues. In recent years, machine learning algorithms were viewed as one of effective solutions for sentiment classification. But, when the amount of the online reviews arises, the dimensionality of text data rises remarkably. The performances of machine learning methods have been degraded due to the dimensionality problem. But, conventional feature selection methods tend to select attributes from the majority sentiments, which usually cannot improve classification performance. Therefore, this study attempt to present two feature selection methods called modified categorical proportional difference (MCPD) approach that improves conventional CPD method, and balance category feature (BCF) strategy that equally selects attributes from both positive and negative examples, to improve sentiment classification performances. Finally, several real sentiment cases of text reviews will be provided to demonstrate the effectiveness of our proposed methods. Results showed that the combination of proposed BCF strategy and MCPD method can not only remarkably reduce feature space, but also improve the sentiment classification performance.

A Performance Comparison of Feature Selection Methods for Sentiment Classification

Chapter

Full-text available

Feb 2018

Document sentiment analysis is the task of determining whether a document has a positive, negative or neutral sentiment. It is made up of subtasks including feature extraction, feature selection and sentiment classification. Feature selection is the task of selecting relevant features that can aid the classifier to produce better results. This paper focuses on comparing the classification performances based on several feature selection methods used to select relevant features and also to minimize the document-term matrix representation of the documents. The purpose of applying feature selection besides selecting relevant features is also to reduce the number of features to preserve the efficiency of the whole system. In this work, the experiment setup is designed in order to investigate the effectiveness of several selected feature selection methods in improving the sentiment analysis results. Based on the findings from the experiment, although common feature selection methods such as Document Frequency (DF), Information Gain (IG) and Chi-Squared Statistics (CSS) are found to be able to produce high sentiment analysis accuracy, the Categorical Probability Proportional Difference (CPPD) method is found to be more effective as it produces higher performance accuracy in classifying the documents based on the sentiments. Although, the Categorical Proportional Difference (CPD) method produces acceptable classification results, it is weak in reducing the number of features. In short, the CPPD method enables the sentiment analysis task to be conducted with higher accuracy rate couples with high feature reduction rate too.

Prediksi Rating Pada Review Produk Kecantikan Menggunakan Metode Naïve Bayes dan Categorical Proportional Difference (CPD)

Article

Full-text available

Aug 2017

Produk kecantikan pada saat ini menjadi hal yang populer di berbagai kalangan, terutama pada kalangan wanita. Hampir kebanyakan dari mereka memiliki produk kecantikan dan termasuk sebagai kebutuhan utama untuk menunjang penampilan mereka yang lebih baik lagi. Adanya suatu produk tidak terlepas dari sebuah komentar atau review dari konsumen untuk produk tersebut. Tentunya dengan adanya review tersebut bisa membantu konsumen untuk lebih selektif lagi dalam memilih suatu produk. Dan dari pihak produksi bisa terbantu untuk mengukur seberapa jauh kualitas produk yang mereka hasilkan. Namun dari pihak produksi sendiri terkadang mengalami kesulitan dalam memilah dan mengkategorikan review, apakah produk tersebut kualitasnya tergolong bagus, cukup bagus, tidak bagus, dan sebagainya. Dalam penelitian ini penilaian suatu produk berdasarkan review yang diberikan adalah rating. Sehingga dibutuhkan sebuah sistem prediksi rating untuk memprediksi dan menentukan rating yang tepat berdasarkan review yang diberikan oleh user terhadap suatu produk. Untuk mendukung sistem yang dibangun dibutuhkan metode untuk menyelesaikan permasalahan tersebut, dalam penelitian ini peneliti menggunakan metode Naïve Bayes dan Categorical Proportional Difference. Naïve Bayes adalah metode untuk klasifikasi sedangkan Categorical Proportional Difference adalah seleksi fitur untuk lebih mengoptimalkan hasil dari klasifikasi. Dari hasil pengujian, didapat tingkat akurasi terbaik pada saat penggunaan fitur sebesar 50% dengan tingkat akurasi sebesar 87%. Hasil tersebut adalah hasil terbaik dari hasil dengan rasio penggunaan fitur yang lain yaitu sebesar 25%, 75% dan 100%. Dari hasil tersebut CPD terbukti bisa melakukan pemilihan kata yang dianggap relevan maupun tidak relevan untuk dilakukan klasifikasi. Beauty products at this time become a popular thing in various circles, especially among women. Almost all of them have beauty products and are included as a primary requirement to support their better performances. The existence of a product can not be separated from a comment or review of the consumer for the product. Of course with the review can help consumers to be more selective again in choosing a product. And from the production side can be helped to measure how far the quality of the products they produce. But from the production itself sometimes have difficulty in sorting and categorize the review, whether the product is good quality, good enough, not good, and so forth. In this study the assessment of a product based on the review given is rating. So it takes a rating prediction system to predict and determine the right rating based on the reviews given by the users of a product. To support the system built required methods to solve the problem, in this study researchers used the method of Naïve Bayes and Categorical Proportional Difference. Naïve Bayes is a method for classification whereas Categorical Proportional Difference is a feature selection to further optimize the results of classification. From the test results, obtained the best accuracy level when the use of features by 50% with an accuracy of 87%. These results are the best results of the results with other feature usage ratios of 25%, 75% and 100%. From these results CPD proven to make the selection of words that are considered relevant or irrelevant to do classification.

A Performance Comparison of Feature Extraction Methods for Sentiment Analysis

Chapter

Full-text available

Mar 2017

Sentiment analysis is the task of classifying documents according to their sentiment polarity. Before classification of sentiment documents, plain text documents need to be transformed into workable data for the system. This step is known as feature extraction. Feature extraction produces text representations that are enriched with information in order to have better classification results. The experiment in this work aims to investigate the effects of applying different sets of features extracted and to discuss the behavior of the features in sentiment analysis. These features extraction methods include unigrams, bigrams, trigrams, Part-Of-Speech (POS) and Sentiwordnet methods. The unigrams, part-of-speech and Sentiwordnet features are word based features, whereas bigrams and trigrams are phrase-based features. From the results of the experiment obtained, phrase based features are more effective for sentiment analysis as the accuracies produced are much higher than word based features. This might be due to the fact that word based features disregards the sentence structure and sequence of original text and thus distorting the original meaning of the text. Bigrams and trigrams features retain some sequence of the sentences thus contributing to better representations of the text.

A Common Subspace Construction Method in Cross-Domain Sentiment Classification

Conference Paper

Full-text available

Jan 2015

An Optimized Arabic Multilabel Text Classification Approach Using Genetic Algorithm and Ensemble Learning

Article

Full-text available

Sep 2023

Multilabel classification of Arabic text is an important task for understanding and analyzing social media content. It can enable the categorization and monitoring of social media posts, the detection of important events, the identification of trending topics, and the gaining of insights into public opinion and sentiment. However, multilabel classification of Arabic contents can present a certain challenge due to the high dimensionality of the representation and the unique characteristics of the Arabic language. In this paper, an effective approach is proposed for Arabic multilabel classification using a metaheuristic Genetic Algorithm (GA) and ensemble learning. The approach explores the effect of Arabic text representation on classification performance using both Bag of Words (BOW) and Term Frequency-Inverse Document Frequency (TF-IDF) methods. Moreover, it compares the performance of ensemble learning methods such as the Extra Trees Classifier (ETC) and Random Forest Classifier (RFC) against a Logistic Regression Classifier (LRC) as a single and ensemble classifier. We evaluate the approach on a new public dataset, namely, the MAWQIF dataset. The MAWQIF is the first multilabel Arabic dataset for target-specific stance detection. The experimental results demonstrate that the proposed approach outperforms the related work on the same dataset, achieving 80.88% for sentiment classification and 68.76% for multilabel tasks in terms of the F1-score metric. In addition, the data augmentation with feature selection improves the F1-score result of the ETC from 65.62% to 68.80%. The study shows the ability of the GA-based feature selection with ensemble learning to improve the classification of multilabel Arabic text.

Digital Banking Financial Business Innovation Research and Index Analysis Focused on Collaborative Filtering Recommendation

Preprint

Full-text available

Aug 2023

Collaborative filtering recommendation is a technology that has rapidly appeared in information filtering and information systems in recent years. At present, it is widely used in commercial activities and has achieved very satisfactory results. The research of this article is based on the basic operating system method and the suggestion of the automatic recognition system (collaborative filtering recommendation), that is, customers purchase fixed deposits. Based on behavioral theory and new institutional arrangements, this article explores the influence and effect of the external development of external digital banks on the digital behavior of traditional commercial banks, and concludes that the development of digital banks has a positive impact on bank operations and product differentiation innovation. The economic pressure brought about by the development of digital banks first promoted the bank's product innovation, while the social pressure mechanism affected the bank's digital innovation in management and production. Social pressure has an impact on bank management and digital innovation. The improvement of financial business transparency and the diversification of financial products have also increased competition in the financial market. Accurately predicting customer preferences is crucial for financial business companies. The development of an effective classification model will not only help increase company profits, but also effectively reduce costs. In the user-based collaborative filtering system research algorithm, by establishing a time-series-based consumer network, determine the targeted influence relationship between users to find the neighbor set more accurately, and establish a time-series-based collaborative filtering algorithm to improve recommendation the accuracy of the algorithm.

Poetic and Semantic Features for Lyricist Identification from Tamil Film Lyrics

Article

Full-text available

Oct 2022

Authorship attribution has been largely investigated based on writing style analysis to identify the author of a given document. This paper describes a supervised approach to identify the lyricist for a given Tamil film lyric document for the first time. In addition to statistical features, linguistic, poetic and semantic features have been used to identify the lyricists. The accuracy of the system was improved by incorporating different classification models with different feature selection methods. The evaluation was carried out using 15,286 lyric documents for a set of 113 lyricists and the performance of the system was determined based on precision, recall and F-measure. The experimental results suggest that the support vector machine (SVM) method can be used to achieve better accuracy compared to the other methods investigated.

Persian Sentiment Analysis: Feature Engineering, Datasets, and Challenges

Article

Full-text available

Sep 2021

With the pervasive growth of web-based businesses, sentiment analysis of online reviews has attracted increasing interest among text mining experts. The problem is complicated when these reviews are in the Persian language since all existing works are focused on the English language, leaving other languages to multilingual models with limited resources. Due to these drawbacks, we try to give an insight regarding different stages of Persian Sentiment Analysis. This study presents a taxonomy of all Persian Sentiment Analysis works considering the most common techniques. The four steps are considered, namely, pre-processing, feature engineering, lexicon generation, and classification. As a result, we reveal that newer works focus on deep learning methods. Also, we suggest applying other methods such as heuristic and hybrid approaches to be worthwhile for the performance of classification in Persian Sentiment Analysis. Finally, we summarize the most important issues in this domain including the lack of dataset, lexicon, tools, etc.

Classifying Scientific Publications with BERT - Is Self-attention a Feature Selection Method?

Chapter

Full-text available

Mar 2021

We investigate the self-attention mechanism of BERT in a fine-tuning scenario for the classification of scientific articles over a taxonomy of research disciplines. We observe how self-attention focuses on words that are highly related to the domain of the article. Particularly, a small subset of vocabulary words tends to receive most of the attention. We compare and evaluate the subset of the most attended words with feature selection methods normally used for text classification in order to characterize self-attention as a possible feature selection approach. Using ConceptNet as ground truth, we also find that attended words are more related to the research fields of the articles. However, conventional feature selection methods are still a better option to learn classifiers from scratch. This result suggests that, while self-attention identifies domain-relevant terms, the discriminatory information in BERT is encoded in the contextualized outputs and the classification layer. It also raises the question whether injecting feature selection methods in the self-attention mechanism could further optimize single sequence classification using transformers.

Classifying Scientific Publications with BERT -- Is Self-Attention a Feature Selection Method?

Preprint

Full-text available

Jan 2021

A New Framework of Feature Selection Approach for Sentiment Analysis

Article

Full-text available

Nov 2020

Undoubtedly that the huge business data could make data analysis becomes more complicated such that the decision-making process would be out of reach. This condition happens. In the fields of consumer buying behavior, A well-known method called sentiment analysis can help in extracting information about the up-to-date trends and is able to increase market value of product through improving its quality. One of the approaches in solving the sentiment analysis is feature selection technique. However, this technique contains a combinatorial behavior and the analysis of the huge data can experience uncertainty parameter. This paper describes a framework for solving the sentiment analysis based on feature selection approach using a stochastic combinatorial programming.

Debugging Crashes using Continuous Contrast Set Mining

Preprint

Nov 2019

Facebook operates a family of services used by over two billion people daily on a huge variety of mobile devices. Many devices are configured to upload crash reports should the app crash for any reason. Engineers monitor and triage millions of crash reports logged each day to check for bugs, regressions, and any other quality problems. Debugging groups of crashes is a manually intensive process that requires deep domain expertise and close inspection of traces and code, often under time constraints. We use contrast set mining, a form of discriminative pattern mining, to learn what distinguishes one group of crashes from another. Prior works focus on discretization to apply contrast mining to continuous data. We propose the first direct application of contrast learning to continuous data, without the need for discretization. We also define a weighted anomaly score that unifies continuous and categorical contrast sets while mitigating bias, as well as uncertainty measures that communicate confidence to developers. We demonstrate the value of our novel statistical improvements by applying it on a challenging dataset from Facebook production logs, where we achieve 40x speedup over baseline approaches using discretization.

Frequency and Distribution Diversity based Optimal Feature Selection for Opinion Mining

Article

Full-text available

Sep 2018

Performance Evaluation of Filter-based Feature Selection Techniques in Classifying Portable Executable Files

Article

Full-text available

Jan 2018

The dimensionality of the feature space exhibits a significant effect on the processing time and predictive performance of the Malware Detection Systems (MDS). Therefore, the selection of relevant features is crucial for the classification process. Feature Selection Technique (FST) is a prominent solution that effectively reduces the dimensionality of the feature space by identifying and neglecting noisy or irrelevant features from the original feature space. The significant features recommended by FST uplift the malware detection rate. This paper provides the performance analysis of four chosen filter-based FSTs and their impact on the classifier decision. FSTs such as Distinguishing Feature Selector (DFS), Mutual Information (MI), Categorical Proportional Difference (CPD), and Darmstadt Indexing Approach (DIA) have been used in this work and their efficiency has been evaluated using different datasets, various feature-length, classifiers, and success measures. The experimental results explicitly indicate that DFS and MI offer a competitive performance in terms of better detection accuracy and that the efficiency of the classifiers does not decline on both the balanced and unbalanced datasets.

The Impact of Sentiment Features on the Sentiment Polarity Classification in Persian Reviews

Article

Full-text available

Feb 2018

Natural language processing (NLP) techniques can prove relevant to a variety of specialties in the field of cognitive science, including sentiment analysis. This paper investigates the impact of NLP tools, various sentiment features, and sentiment lexicon generation approaches to sentiment polarity classification of internet reviews written in Persian language. For this purpose, a comprehensive Persian WordNet (FerdowsNet), with high recall and proper precision (based on Princeton WordNet), was developed. Using FerdowsNet and a generated corpus of reviews, a Persian sentiment lexicon was developed using (i) mapping to the SentiWordNet and (ii) a semi-supervised learning method, after which the results of both methods were compared. In addition to sentiment words, a set of various features were extracted and applied to the sentiment classification. Then, by employing various well-known feature selection approaches and state-of-the art machine learning methods, a sentiment classification for Persian text reviews was carried out. The obtained results demonstrate the critical role of sentiment lexicon quality in improving the quality of sentiment classification in Persian language.

Insights Exploration of Structured and Unstructured Data and Construction of Automated Knowledge Banks

Article

Apr 2016

Supervised Term Weighting Metrics for Sentiment Analysis in Short Text

Article

Full-text available

Oct 2016

Term weighting metrics assign weights to terms in order to discriminate the important terms from the less crucial ones. Due to this characteristic, these metrics have attracted growing attention in text classification and recently in sentiment analysis. Using the weights given by such metrics could lead to more accurate document representation which may improve the performance of the classification. While previous studies have focused on proposing or comparing different weighting metrics at two-classes document level sentiment analysis, this study propose to analyse the results given by each metric in order to find out the characteristics of good and bad weighting metrics. Therefore we present an empirical study of fifteen global supervised weighting metrics with four local weighting metrics adopted from information retrieval, we also give an analysis to understand the behavior of each metric by observing and analysing how each metric distributes the terms and deduce some characteristics which may distinguish the good and bad metrics. The evaluation has been done using Support Vector Machine on three different datasets: Twitter, restaurant and laptop reviews.

SteM at SemEval-2016 Task 4: Applying Active Learning to Improve Sentiment Classification

Conference Paper

Full-text available

Jun 2016

This paper describes our approach to the SemEval 2016 task 4, " Sentiment Analysis in Twitter " , where we participated in subtask A. Our system relies on AlchemyAPI and Senti-WordNet to create 43 features based on which we select a feature subset as final representation. Active Learning then filters out noisy tweets from the provided training set, leaving a smaller set of only 900 tweets which we use for training a Multinomial Naive Bayes classifier to predict the labels of the test set with an F1 score of 0.478.

Sentiment Analysis in Social Media

Thesis

Full-text available

May 2016

Hussam Hamdan

In this thesis, we address the problem of sentiment analysis. More specifically, we are interested in analyzing the sentiment expressed in social media texts such as tweets or customer reviews about restaurant, laptop, hotel or the scholarly book reviews written by experts. We focus on two main tasks : sentiment polarity detection in which we aim to deter- mine the polarity (positive, negative or neutral) of a given text and the opinion target extraction in which we aim to extract the opinion targets that people tend to express their opinions towards them (e.g. food, pizza and service are opinion targets in restau- rant reviews). Our main objective is constructing state-of-the-art systems which can do the two tasks. Therefore, for evaluation purpose, we have participated at an International Work-shop on Semantic Evaluation (SemEval), we have chosen two tasks : (1) Sentiment ana- lysis in Twitter in which we seek to determine the polarity of tweet and (2) Aspect-Based sentiment analysis which aims to extract the opinion targets in restaurant reviews, then to determine the polarity of each target. We have also applied and evaluated our me-thods using a French book reviews corpus constructed by OpenEdition team in which we extract also the opinion targets and their polarities. Our proposed methods are supervised for both tasks : 1. For polarity sentiment detection, we address three points : term weighting, feature extraction and classification method. We first study several supervised term weigh- ting metrics and analyze the behavior of term weighting metric which could give a good performance. Then, we enrich the document representation by extracting several groups of features. As the features extracted from sentiment lexicons seem to be the most influential features, we propose a new metric called natural entropy to construct an automatic sentiment lexicon from noisy labeled Twitter corpus and combine the features extracted from this lexicon to improve the performance. The evaluation demonstrates that this rich feature extraction process can produce a state-of-the-art system in sentiment analysis. After these experiments with term weighting and features extraction with classic classification methods such as Sup- port Vector Machines and Logistic Regression, we have found that it is difficult to understand the decision of those classic methods. Therefore, we propose a simple and an interpretable model for estimating the polarity of text. This new model relies on bottom-up approach where it goes from word polarity to text polarity de- tection. Our first experiments will show that this new model seems to be promising and could outperform the classic methods. 2. For opinion target extraction, we adopt a Conditional Random Field model with feature extraction process, most of the extracted feature have proved their per- formance in entity extraction problem. We applied this model for extracting the opinion targets in English restaurant reviews and French scholarly book reviews.

A Review on Feature Selection Methods for Sentiment Analysis

Article

Full-text available

Oct 2015
J. Comput. Theor. Nanosci.

Text documents are normally represented as a feature-document matrix in sentiment analysis. Features can be single words from the text document or more complex pairs extracted by different schemes that adds information in order to enrich the feature-document matrix representation. Having diverse feature types however creates a problem of high dimensionality due to the vast number of features and relations they hold. Thus, feature selection helps in ensuring that effective and efficient sentiment analysis applications can be developed by selecting features that are relevant and informative to assist classifiers to perform better and to reduce the processing load by narrowing down the feature set. This paper highlights methods used for feature selection, namely filter, wrapper and embedded. Prior to feature selection, preprocessing techniques are performed to reduce the amount of features first. This paper is concluded by summarizing this review and outlining the challenges faced and proposing the ensemble feature selection method for sentiment analysis data.

A Review on the Ensemble Framework for Sentiment Analysis

Article

Full-text available

Oct 2015
J. Comput. Theor. Nanosci.

Sentiment analysis is an important task for the automated classification of positive and negative opinions by a machine. The approaches to this task can either be rule-based or machine learning, the later being the current trend due to its automaticity and versatility. An ensemble framework based on machine learning classifiers diversifies the text data and learning process in order to produce accurate sentiment classification. In this review paper, we highlight the various parts to consider when creating an ensemble of machine learning classifiers for sentiment analysis. Starting from the selection of suitable text features until the training of the machine learner, they all influence the accuracy of the system. The ensemble framework improves the accuracy of the system based on the concept that stronger classifiers compensate for the performance of weak classifiers. Concluding thought to the paper is the inclusion of diversified feature selection method into the ensemble framework to select useful features and reduce size of feature set.

Identifying Key Factors and Developing a New Method for Classifying Imbalanced Sentiment Data

Article

Full-text available

Mar 2012

Bloggers' opinions related to commercial products/services might have a significant influence on consumers' purchasing decisions. Some negative comments could reduce consumers' purchase intentions and bring a great damage to enterprises. But, the comments in blogs are often unstructured, subjective, and hard to comprehend in short time. In some cases, the negative comments are usually fewer than the positive opinions. These fewer negative comments spread very fast and are much harmful. According to a consumer reviews and research online report, 62% of online customers will change their mind about buying a product or service after reading 1∼3 negative reviews. But, when dealing with such imbalanced sentiment data, researchers didn't consider the class imbalance problem. A classifier induced from an imbalanced data set has high classification accuracy for the majority class, but an unacceptable error rate for the minority class. Therefore, to identify consumers' negative sentiments effectively from a large number of online comments had become one of serious issues. So, this study aims to identify the key factors of imbalanced sentiment classification by using Taguchi method. Then, according to the discovered key factors, we'll propose a new feature selection method to improve the performance of imbalanced sentiment classification. Moreover, support vector machines (SVM) have been employed to construct classifiers for identifying bloggers' negative sentiments. Finally, one case study from real world blogs will be provided to illustrate the effectiveness of our proposed approach.

Immune Based Feature Selection for Opinion Mining

Article

Full-text available

Jul 2013

Opinions about a particular product, service or person are communicated effectively through online media such as Facebook, MySpace and Twitter. Unfortunately only a few researchers had researched on the performance of opinion mining using online messages that were written in Malay Languages. Opinion mining processing that uses Natural Language Processing approach is difficult due to the high content of noisy texts in online messages. On the other hand, opinion mining that uses machine learning approach requires a good feature selection technique since the current filter typed feature selection techniques require interference from the user to select the appropriate features. This study used a feature selection technique based on artificial immune system to select the appropriated features for opinion mining. Experiments with 2000 online movie reviews illustrated that the technique has reduced 90% of the features and improved opinion mining accuracy up to 15% with k Nearest Neighbor classifier and up to 6% with Naïve Baiyes classifier.

A Supervised Term Selection Technique for Effective Text Categorization

Article

Full-text available

Oct 2016

Term selection methods in text categorization effectively reduce the size of the vocabulary to improve the quality of classifier. Each corpus generally contains many irrelevant and noisy terms, which eventually reduces the effectiveness of text categorization. Term selection, thus, focuses on identifying the relevant terms for each category without affecting the quality of text categorization. A new supervised term selection approach have been proposed for dimensionality reduction. The method assigns a score to each term of a corpus based on its similarity with all the categories, and then all the terms are ranked accordingly. Subsequently the significant terms of each category are selected to create the final subset of terms irrespective of the size of the category. The performance of the proposed term selection technique is compared with the performance of nine other term selection methods for categorization of several well known text corpora using kNN and SVM classifiers. The empirical results show that the proposed one performs significantly better than the other methods in most of the cases of all the corpora.

Cross-Domain Sentiment Classification--Feature Divergence, Polarity Divergence or Both？

Article

Nov 2015
PATTERN RECOGN LETT

Unknown Metamorphic Malware Detection: Modelling with Fewer Relevant Features and Robust Feature Selection Techniques

Article

Full-text available

Apr 2015

P. Vinod

Detection of metamorphic malware is a challenging problem as a result of high diversity in the internal code structure between generations. Code morphing/obfuscation when applied, reshapes malware code without compromising the maliciousness. As a result, signature based scanners fail to detect metamorphic malware. Prior research in the domain of metamorphic malware detection utilizes similarity matching techniques. This work focuses on the development of a statistical scanner for metamorphic virus detection by employing feature ranking methods such as Term Frequency-Inverse Document Frequency (TF-IDF), Term Frequency-Inverse Document Frequency-Class Frequency (TF-IDF-CF), Categorical Proportional Distance (CPD), Galavotti-Sebastiani-Simi Coefficient (GSS), Weight of Evidence of Text (WET), Term Significance (TS), Odds Ratio (OR), Weighted Odds Ratio (WOR) Multi-Class Odds Ratio (MOR) Comprehensive Measurement Feature Selection (CMFS) and Accuracy2 (ACC2). Malware and benign model for classification are developed by considering top ranked features obtained using individual feature selection methods. The proposed statistical detector detects Metamorphic worm (MWORM) and viruses which are generated using Next Generation Virus Construction Kit (NGVCK) with 100% accuracy and precision. Further, relevance of feature ranking methods at varying lengths are determined using McNemar test. Thus, the designed non–signature based scanner can detect sophisticated metamorphic malware, and can be used to support current antivirus products.

Ranked linear discriminant analysis features for metamorphic malware detection

Conference Paper

Full-text available

Feb 2014

Metamorphic malware modifies the code of every new offspring by using code obfuscation techniques. Recent research have depicted that metamorphic writers make use of benign dead code to thwart signature and Hidden Markov based detectors. Failure in the detection is due to the fact that the malware code appear statistically similar to benign programs. In order to detect complex malware generated with hacker generated tool i.e. NGVCK known to the research community, and the intricate metamorphic worm available as benchmark data we propose, a novel approach using Linear Discriminant Analysis (LDA) to rank and synthesize most prominent opcode bi-gram features for identifying unseen malware and benign samples. Our investigation resulted in 99.7% accuracy which reveals that the current method could be employed to improve the detection rate of existing malware scanner available in public.

Mining Opinion in Online Messages

Article

Full-text available

Sep 2013

The number of messages that can be mined from online entries increases as the number of online application users increases. In Malaysia, online messages are written in mixed languages known as ‘Bahasa Rojak’. Therefore, mining opinion using natural language processing activities is difficult. This study introduces a Malay Mixed Text Normalization Approach (MyTNA) and a feature selection technique based on Immune Network System (FS-INS) in the opinion mining process using machine learning approach. The purpose of MyTNA is to normalize noisy texts in online messages. In addition, FS-INS will automatically select relevant features for the opinion mining process. Several experiments involving 1000 positive movies feedback and 1000 negative movies feedback have been conducted. The results show that accuracy values of opinion mining using Naïve Bayes (NB), k-Nearest Neighbor (kNN) and Sequential Minimal Optimization (SMO) increase after the introduction of MyTNA and FS-INS.

Article

Full-text available

Sep 2019

When unauthorized copying or stealing of intellectual properties of others happen, it is called plagiarism. Two main approaches are used to counter this problem – external plagiarism detection and intrinsic plagiarism detection. External algorithms compare a suspicious file with numerous sources whereas intrinsic algorithms are allowed to solely inspect the suspicious file in order to predict plagiarism. In this work, the area chosen for detecting plagiarism is with programs or source code files. Copying the entire source code or logic used in a particular program without permissions or copyright is the stealing that happens in the case of source codes. There exist many ways to detect plagiarism in source code files. To perform plagiarism checking for a large dataset, the computational cost is very high and moreover it’s a time consuming job. To achieve a computationally efficient similarity detection in source code files, the Hadoop framework is used where parallel computation is possible for large datasets. But the raw data available to us is not in a suitable form for the existing plagiarism checking tools to work with, as their size is too high and they possess features of big data. Thus a qualifying model is required for the dataset, to be fed into Hadoop so that it could efficiently process them to check for plagiarism in source codes. To generate such a model, machine learning is used which incorporates big data with machine learning.

A Study on Analysing the impact of Feature Selection on Predictive Machine Learning Algorithms

Conference Paper

Nov 2020

Debugging crashes using continuous contrast set mining

Conference Paper

Jun 2020

Efficient Feature Selection techniques for Sentiment Analysis

Preprint

Nov 2019

Sentiment analysis is a domain of study that focuses on identifying and classifying the ideas expressed in the form of text into positive, negative and neutral polarities. Feature selection is a crucial process in machine learning. In this paper, we aim to study the performance of different feature selection techniques for sentiment analysis. Term Frequency Inverse Document Frequency (TF-IDF) is used as the feature extraction technique for creating feature vocabulary. Various Feature Selection (FS) techniques are experimented to select the best set of features from feature vocabulary. The selected features are trained using different machine learning classifiers Logistic Regression (LR), Support Vector Machines (SVM), Decision Tree (DT) and Naive Bayes (NB). Ensemble techniques Bagging and Random Subspace are applied on classifiers to enhance the performance on sentiment analysis. We show that, when the best FS techniques are trained using ensemble methods achieve remarkable results on sentiment analysis. We also compare the performance of FS methods trained using Bagging, Random Subspace with varied neural network architectures. We show that FS techniques trained using ensemble classifiers outperform neural networks requiring significantly less training time and parameters thereby eliminating the need for extensive hyper-parameter tuning.

Biomedical Document Clustering and Visualization based on the Concepts of Diseases

Preprint

Oct 2018

Document clustering is a text mining technique used to provide better document search and browsing in digital libraries or online corpora. A lot of research has been done on biomedical document clustering that is based on using existing ontology. But, associations and co-occurrences of the medical concepts are not well represented by using ontology. In this research, a vector representation of concepts of diseases and similarity measurement between concepts are proposed. They identify the closest concepts of diseases in the context of a corpus. Each document is represented by using the vector space model. A weight scheme is proposed to consider both local content and associations between concepts. A Self-Organizing Map is used as document clustering algorithm. The vector projection and visualization features of SOM enable visualization and analysis of the clusters distributions and relationships on the two dimensional space. The experimental results show that the proposed document clustering framework generates meaningful clusters and facilitate visualization of the clusters based on the concepts of diseases.

Klasifikasi Dokumen Sambat Online Menggunakan Metode K-Nearest Neighbor dan Features Selection Berbasis Categorical Proportional Difference

Article

Full-text available

Aug 2018

Sambat Online merupakan fasilitas yang berfungsi untuk menampung saran, kritik, keluhan atau pertanyaan dari masyarakat kota Malang seputar Pemerintah Kota Malang melalui situs web yang sudah disediakan atau melalui pesan singkat kepada nomor yang sudah disediakan. Suatu teks pengaduan yang masuk akan dikategorikan ke dalam berbagai bidang SKPD yang bertanggung jawab, untuk mempermudah mengorganisir teks pengaduan dan meningkatkan efisiensi waktu administrator dalam memilah dan menentukan bidang SKPD tujuan maka perlu dibuat sistem cerdas yang dapat mengklasifikasikan dokumen sesuai tujuan. K-Nearest Neighbor (K-NN) merupakan metode klasifikasi yang mana akan mencari dokumen yang memiliki kedekatan antara dokumen. Metode seleksi fitur yang digunakan adalah menggunakan metode Categorical Proportional Difference (CPD) untuk mengukur derajat kontribusi sebuah kata. Proses yang dilakukan adalah mengumpulkan dokumen latih dan dokumen uji, melakukan tahap preprocessing dan seleksi fitur, pembobotan, kemudian dilakukan klasifikasi, dan pada tahap akhir dilakukan pengujian dan analisis terhadap hasil klasifikasi oleh sistem terkait nilai accuracy, precision, recall, dan F-Measure. Hasilnya kinerja yang paling optimal adalah penggunaan k=1 dengan feature sebanyak 100% sebesar 91,84%, yang mana nilai akurasinya lebih baik dibandingkan dengan adanya seleksi fitur karena adanya penghapusan term yang memiliki nilai CPD yang rendah. Sambat Online is a platform to facilitate the suggestions, criticisms, complaints or questions from public to the Government of Malang through provided websites or via short messages. Incoming complaints, will be categorized into various fields of SKPD. To make it easier to organize the text and increase the efficiency of the administrator in sorting out and define the field of SKPD, an intelligent systems that can classify documents according to its SKPD's field is needed. K-Nearest Neighbor (K-NN) is a method of classification that will be used to find similarities between documents. Feature selection method used in this research is Categorical Proportional Difference (CPD) to measure the degree of contribution of a word. Started from collecting the test documents and training documents, continue to the preprocessing stage and selection features, weighting, and then do the classification, and analysing the results of the classification system by value of accuracy, precision, recall, and F-Measure. The result is the most optimal performance is the use of k = 1 with featured as much as 100% of 91.84%, which shows better value compared to the featured selection due to the removal of the term with low CPD value.

Modeling public mood and emotion: Blog and news sentiment and socio-economic phenomena

Article

Nov 2017
FUTURE GENER COMP SY

The development of online virtual communities has raised the importance in analyzing massive volume of text from websites and social networks. This research analyzed financial blogs and online news articles to develop a public mood dynamic prediction model for stock markets, referencing the perspectives of behavioral finance and the characteristics of online financial communities. This research applies big data and opinion mining approaches to the investors' sentiment analysis in Taiwan. The proposed model was verified using experimental datasets from ChinaTimes.com, cnYES.com, Yahoo stock market news, and Google stock market news over an 18 month period. Empirical results indicate the big data analysis techniques to assess emotional content of commentary on current stock or financial issues can effectively forecast stock price movement.

Sentiment Analysis Using ConceptNet Ontology and Context Information

Chapter

Dec 2016

Sentiment analysis research has been increasing tremendously in recent times due to the wide range of business and social applications. Sentiment analysis from unstructured natural language text has recently received considerable attention from the research community. In this chapter, we propose a novel sentiment analysis model based on commonsense knowledge extracted from ConceptNet-based ontology and context information. ConceptNet-based ontology is used to determine the domain-specific concepts which in turn produced the domain-specific important features. Further, the polarities of the extracted concepts are determined using the contextual polarity lexicon which we developed by considering the context information of a word. Finally, semantic orientations of domain-specific features of the review document are aggregated based on the importance of a feature with respect to the domain. The importance of the feature is determined by the depth of the feature in the ontology. Experimental results show the effectiveness of the proposed methods.

Conclusions and Future Work

Chapter

Jan 2016

The field of sentiment analysis is an exciting new research direction due to large number of real-world applications where discovering people’s opinion is important in better decision-making. The development of techniques for the document-level sentiment analysis is one of the significant components of this area. Recently, people have started expressing their opinions on the Web that increased the need of analyzing the opinionated online content for various real-world applications. A lot of research is present in literature for detecting sentiment from the text. Still, there is a huge scope of improvement of these existing sentiment analysis models. Existing sentiment analysis models can be improved further with more semantic and commonsense knowledge.

Semantic Parsing Using Dependency Rules

Chapter

Dec 2016

Sentiment analysis from unstructured natural language text has recently received considerable attention from the research community. In the frame of biologically inspired machine learning approaches, finding good feature sets is particularly challenging yet very important. In this chapter, we focus on this fundamental issue of the sentiment analysis task. Specifically, we employ concepts as features and present a concept extraction algorithm to extract semantic features that exploit semantic relationships between words in natural language text. Additional conceptual information of a concept is obtained using the ConceptNet ontology. Concepts extracted from text are sent as queries to ConceptNet to extract their semantics. Further, we select important concepts and eliminate redundant concepts using the Minimum Redundancy and Maximum Relevance feature selection technique. All selected concepts are then used to build a machine learning model that classifies a given document as positive or negative

Semantic Orientation-Based Approach for Sentiment Analysis

Chapter

Jan 2016

Two types of techniques have been used in the literature for semantic orientation-based approach for sentiment analysis, viz., (i) corpus based and (ii) dictionary or lexicon or knowledge based. In this chapter, we explore the corpus-based semantic orientation approach for sentiment analysis. Corpus-based semantic orientation approach requires large dataset to detect the polarity of the terms and therefore the sentiment of the text. The main problem with this approach is that it relies on the polarity of the terms that have appeared in the training corpus since polarity is computed for the terms that are in the corpus. This approach has been explored well in the literature due to the simplicity of this approach [29, 120]. This approach initially mines sentiment-bearing terms from the unstructured text and further computes the polarity of the terms. Most of the sentiment-bearing terms are multi-word features unlike bag-of-words, e.g., “good movie,” “nice cinematography,” “nice actors,” etc. Performance of semantic orientation-based approach has been limited in the literature due to inadequate coverage of the multi-word features.

Literature Survey

Chapter

Jan 2016

Sentiment analysis research has attracted a large number of researchers around the globe [61,66, 93, 127]. Sentiment analysis attempts to determine whether a given text is subjective or objective and further, whether a subjective text contains positive or negative opinion. Techniques employed by sentiment analysis models can be broadly categorized into machine learning [93] and semantic orientation approaches [29, 132]. A lot of research has been done for detecting sentiment from the text [67]. Still, there is a huge scope of improvement of these existing sentiment analysis models. The performance of the existing methods can be further improved by including more semantic information.

Machine Learning Approach for Sentiment Analysis

Chapter

Jan 2014

Opinion Mining or Sentiment Analysis is the study that analyzes people's opinions or sentiments from the text towards entities such as products and services. It has always been important to know what other people think. With the rapid growth of availability and popularity of online review sites, blogs', forums', and social networking sites' necessity of analysing and understanding these reviews has arisen. The main approaches for sentiment analysis can be categorized into semantic orientation-based approaches, knowledge-based, and machine-learning algorithms. This chapter surveys the machine learning approaches applied to sentiment analysis-based applications. The main emphasis of this chapter is to discuss the research involved in applying machine learning methods mostly for sentiment classification at document level. Machine learning-based approaches work in the following phases, which are discussed in detail in this chapter for sentiment classification: (1) feature extraction, (2) feature weighting schemes, (3) feature selection, and (4) machine-learning methods. This chapter also discusses the standard free benchmark datasets and evaluation methods for sentiment analysis. The authors conclude the chapter with a comparative study of some state-of-the-art methods for sentiment analysis and some possible future research directions in opinion mining and sentiment analysis.

Towards Spam Mail Detection using Robust Feature Evaluated with Feature Selection Techniques

Article

Full-text available

Nov 2014

Filtering of spam emails is a significant operation in email system. The efficiency of this process is determined by many factors such as number of features, representation of samples, classifier etc. This study covers all these factors and aims to find the optimal settings for email spam filtering. Twelve feature selection methods extensively used in text categorization are implemented to synthesize prominent attributes from different categories (i.e. header, subject and body of the mails). Optimal classification performances are obtained for Weighted Mutual Information and Log-TFIDF-Cosine(LTC) feature selection methods for header and body features of the mail with Random Forest and Support Vector Machine classifiers respectively. An overall F1-measure of 0.978 with 0.44s prediction time is achieved when 20% of the original feature length is considered.

Discriminant features for metamorphic malware detection

Conference Paper

Aug 2014

To unfold a solution for the detection of metamorphic viruses (obfuscated malware), we propose a non signature based approach using feature selection techniques such as Categorical Proportional Difference (CPD), Weight of Evidence of Text (WET), Term Frequency-Inverse Document Frequency (TF-IDF) and Term Frequency-Inverse Document Frequency-Class Frequency (TF-IDF-CF). Feature selection methods are employed to rank and prune bi-gram features obtained from malware and benign files. Synthesized features are further evaluated for their prominence in either of the classes. Using our proposed methodology 100% accuracy is obtained with test samples. Hence, we argue that the statistical scanner proposed by us can identify future metamorphic variants and can assist antiviruses with high accuracy.

Improved CPD feature selection research in text categorization

Article

Dec 2009

Text classification is an important part of information retrieval and textmining. The great dimension of feature space causes that the ability of thefeature items distinguishing category is not well. In this paper, we introduce aCPD feature selection method to reduce the feature space dimension. This methoddoes not consider the importance of items in the document and the relevancebetween items. We define the frequency, dispersion, concentration and featureredundancy, using the min-frequency and the mutual information between items toimprove the ability of the items distinguishing categories, to remove theredundancy of feature items, to reduce computational complexity, eventuallyincrease precision rate and recall rate in classification. Bayes classifier isused for text classification, F value is used for the evaluation indexes. Theexperimental result shows that improved CPD is superior to CPD and other featureselection methods. ICIC International

Sentiment classification using principal component analysis based neural network model

Article

Feb 2015

The rapid growth of online social media acts as a medium where people contribute their opinion and emotions as text messages. The messages include reviews and opinions on certain topics such as movie, book, product, politics and so on. Opinion mining refers to the application of natural language processing, computational linguistics, and text mining to identify or classify whether the opinion expressed in text message is positive or negative. Back Propagation Neural Networks is supervised machine learning methods that analyze data and recognize the patterns that are used for classification. This work focuses on binary classification to classify the text sentiment into positive and negative reviews. In this study Principal Component Analysis (PCA) is used to extract the principal components, to be used as predictors and back propagation neural network (BPN) have been employed as a classifier. The performance of PCA+ BPN and BPN without PCA has been compared using Receiver Operating Characteristics (ROC) analysis. The classifier is validated using 10-Fold cross validation. The result shows the effectiveness of BPN with PCA used as a feature reduction method for text sentiment classification.

A robust classification model with voting based feature selection for diagnosis of epilepsy

Article

Jun 2015

It is well a known fact that neuropsychiatric disorders cause abnormalities in connectivity patterns of brain regions. Identifying and characterising these abnormalities can be exploited to get better diagnosis of neuropsychiatric diseases with help of resting state functional magnetic resonance imaging (rfMRI) data. But this is not an easy task because rfMRI produces data that has very large dimensions that will lead to curse of dimensionality problem. So it is necessary to reduce the number of features in order to get better classification accuracy. This needs a robust feature selection criterion that best describes the differences between epileptic patients and healthy control group. In this paper we present a classification model in which we introduce a voting based feature selection (VFS) approach that ensures the selection of most discriminative features by combining the capabilities of several feature selection techniques. We used AdaBoost for RBF network as a classifier to avoid over fitting. We applied this model on rfMRI-based data to discriminate between two groups. We correctly classify epileptic patients from healthy controls with 85.33% classification accuracy on a heterogeneous data set using the proposed classification model. The results presented in this paper are better than other reported results in the current literature on this dataset to the best of our knowledge confirming the effectiveness of our classification model.

Towards the Detection of Undetectable Metamorphic Malware

Article

Sep 2014

Our research developed a non signature based approach, employing feature selection methods such as Categorical Proportional Distance (CPD), Weight of Evidence of Text (WET), Term Frequency - Inverse Document Frequency (TF-IDF), Term Frequency - Inverse Document Frequency - Class Frequency (TF-IDF-CF), Galavotti-Sebastiani-Simi Coefficient (GSS) and Term Significance (TS). Classification model is developed by considering bi-gram features ranked with these feature selection techniques. The proposed feature selection approaches detect unseen malware samples with accuracy in the range of 99% to 100%. Relevance of a feature ranking methods on variable feature length is ascertained using McNemar test.

An extensive empirical study of feature selection metrics for text classification [J]

Article

Full-text available

Mar 2003

George Forman

Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives-accuracy, F-measure, precision, and recall-since each is appropriate in different situations. The results reveal that a new feature selection metric we call 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair-e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.

Machine Learning in Automated Text Categorization

Article

Full-text available

Apr 2001

Fabrizio Sebastiani

The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.

Feature Selection for Unbalanced Class Distribution and Naive Bayes.

Conference Paper

Full-text available

Jan 1999

Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization.

Conference Paper

Full-text available

Jan 2000

We tackle two different problems of text categorization (TC), namely feature selection and classifier induction. Feature selection (FS) refers to the activity of selecting, from the set of r distinct features (i.e. words) occurring in the collection, the subset of r′ ≪ r features that are most useful for compactly representing the meaning of the documents. We propose a novel FS technique, based on a simplified variant of the X2 statistics. Classifier induction refers instead to the problem of automatically building a text classifier by learning from a set of documents pre-classified under the categories of interest. We propose a novel variant, based on the exploitation of negative evidence, of the well-known k-NN method. We report the results of systematic experimentation of these two methods performed on the standard REUTERS-21578 benchmark.

Some Effective Techniques for Naive Bayes Text Classification

Article

Full-text available

Dec 2006

While naive Bayes is quite effective in various data mining tasks, it shows a disappointing result in the automatic text classification problem. Based on the observation of naive Bayes for the natural language text, we found a serious problem in the parameter estimation process, which causes poor results in text classification domain. In this paper, we propose two empirical heuristics: per-document text normalization and feature weighting method. While these are somewhat ad hoc methods, our proposed naive Bayes text classifier performs very well in the standard benchmark collections, competing with state-of-the-art text classifiers based on a highly complex learning method such as SVM

Text Categorization with Support Vector Machines: Learning with Many Relevant Features

Article

Jan 1998

Thorsten Joachims

Feature Selection for Text Classification

Article

Jan 2008

George Forman

Text Categorization Using Weight Adjusted k-Nearest Neighbor Classica tion

Article

Nov 2000

Categorization of documents is challenging, as the number of discriminating words can be very large. We present a nearest neighbor classification scheme for text categorization in which the importance of discriminating words is learned using mutual information and weight adjustment techniques. The nearest neighbors for a particular document are then computed based on the matching words and their weights. We evaluate our scheme on both synthetic and real world documents. Our experiments with synthetic data sets show that this scheme is robust under different emulated conditions. Empirical results on real world documents demonstrate that this scheme outperforms state of the art classification algorithms such as C4.5, RIPPER, Rainbow, and PEBLS.

Scoring and Selecting Terms for Text Categorization

Article

Jun 2005

We propose a set of (machine learning) ML-based scoring measures for conducting feature selection. We've tested these measures on documents from two well-known corpora, comparing them with other measures previously applied for this purpose. In particular, we've analyzed which measure obtains the best overall classification performance in terms of properties such as precision and recall, emphasizing to what extent some statistical properties of the corpus affects performance. The results show that some of our measures outperform the traditional measures in certain situations.

Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization

Article

Dec 2000

In this paper, we describe an automated learning approach to text categorization based on perceptron learning and a new feature selection metric, called correlation coefficient. Our approach has been tested on the standard Reuters text categorization collection. Empirical results indicate that our approach outperforms the best published results on this Reuters collection. In particular, our new feature selection method yields considerable improvement. We also investigate the usability of our automated learning approach by actually developing a system that categorizes texts into a tree of categories. We compare the accuracy of our learning approach to a rule-based, expert system approach that uses a text categorization shell built by Carnegie Group. Although our automated learning approach still gives a lower accuracy, by appropriately incorporating a set of manually chosen words to use as features, the combined, semi-automated approach yields accuracy close to the rulebased approach. ...

A Comparative Study on Feature Selection in Text Categorization

Article

Apr 1998

This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision) . DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 5...

Categorical Proportional Difference: A Feature Selection Method for Text Categorization.

Abstract

No full-text available

Recommended publications

Improved Comprehensive Measurement Feature Selection Method for Text Categorization

An empirical evaluation of text classification and feature selection methods

An information theoretic approach to text sentiment analysis

Automatic Chinese text categorization system based on mutual information