Conference Paper

Categorical Proportional Difference: A Feature Selection Method for Text Categorization.

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Supervised text categorization is a machine learning task where a predefined category label is automati- cally assigned to a previously unlabelled document based upon characteristics of the words contained in the document. Since the number of unique words in a learning task (i.e., the number of features) can be very large, the efficiency and accuracy of the learning task can be increased by using feature selection methods to extract from a document a subset of the features that are considered most relevant. In this paper, we introduce a new feature selection method called categorical proportional difference (CPD), a measure of the degree to which a word contributes to differentiating a particular category from other categories. The CPD for a word in a particular category in a text corpus is a ratio that considers the number of documents of a category in which the word occurs and the number of documents from other categories in which the word also occurs. We conducted a series of experiments to evaluate CPD when used in conjunction with SVM and Naive Bayes text classifiers on the OHSUMED, 20 Newsgroups, and Reuters-21578 text corpora. Recall, precision, and the F-measure were used as the measures of performance. The results obtained using CPD were compared to those obtained using six common feature selection methods found in the literature: �2, information gain, document frequency, mutual information, odds ratio, and simplified �2. Empirical results showed that, in general, according to the F-measure, CPD outperforms the other feature se- lection methods in four out of six text categorization tasks.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We follow [MRS08, footnote on page 272] and name it Pointwise Mutual Information. The name Mutual Information is later also used on this method in [SH08] and [Seb02]. We have not tested this metric, as it is known for its bad results. ...
... Example The e-mail example is ranked by weighted average BNS values in The rest of the values are calculated in the same way, and are presented in [SH08]. ...
... Categorical Proportional Difference is in [SH08] reported to have excellent performance both for Naïve Bayes and Support Vector Machines, but at the cost of low aggressivity levels. Their study used an exhaustive search to find the percentage of features each feature selection method performed best at. ...
... In the literature, different approaches were taken in selecting features and then different algorithms were used for training models to analyze sentiment analysis of tweets. However, Categorical Proportional Difference (CPD) and Chisquare based feature extraction for sentiment analysis is rarely found, which are good candidates for supervised text classification [11,12]. The main objective of this study is finding the best SVM kernel function with optimal feature selection strategy. ...
... i) Categorical Proportional Difference (CPD): This is a feature selection technique for text categorization. CPD measures the degree to which a word contributes to differentiating a particular category from other categories [11]. In this method, a CPD value is assigned to every n-gram and n-gram with the highest score is selected as equation (1). ...
Article
Full-text available
Social media has become incredibly popular these days for communicating with friends and for sharing opinions. According to current statistics, almost 2.22 billion people use social media in 2016, which is roughly one third of the world population and three times of the entire population in Europe. In social media people share their likes, dislikes, opinions, interests, etc. so it is possible to know about a person’s thoughts about a specific topic from the shared data in social media. Since, twitter is one of the most popular social media in the world; it is a very good source for opinion mining and sentiment analysis about different topics. In this research, SVM with different kernel functions and Adaboost are experimented using CPD and Chi-square feature extraction techniques to explore the best sentiment classification model. The reported average accuracy of Adaboost for Chi-square and CPD are 70.2% and 66.9%. The SVM radial basis kernel and polynomial kernel with Chi-square n-grams reported average accuracy of 73.73% and 68.67% respectively. Among the performed experimentation, SVM sigmoid kernel with Chi-square n-grams provided the maximum accuracy that is 74.4%.
... Other approaches calculate a score for each individual features and then select a predefined amount of feature set based on the rank of scores, such as Chi-square statistic (CHI), information gain (IG) and so on (Keshtkar and Inkpen 2009;O'Keefe and Koprinska 2009;Simeon and Hilderman 2008;Tan and Zhang 2008;Ye et al. 2009). From Table 1, we can know these kinds of methods are effective in some experiments. ...
... CPD (Simeon and Hilderman 2008) is another easy term selection method for multi-class classification problems. O'Keefe and Koprinska (2009) employed CPD on binary sentiment classification. ...
Article
Full-text available
Text based social media has become one of important communication tools between customers and enterprises. In social media, users can easily express their opinions and evaluation regarding products or services. These online user experiences, especially negative evaluations indeed affect other consumers’ behaviors. Consequently, to effectively identify customers’ sentiments and avoid these negative comments to bring a great damage to enterprisers has become one of critical issues. In recent years, machine learning algorithms were viewed as one of effective solutions for sentiment classification. But, when the amount of the online reviews arises, the dimensionality of text data rises remarkably. The performances of machine learning methods have been degraded due to the dimensionality problem. But, conventional feature selection methods tend to select attributes from the majority sentiments, which usually cannot improve classification performance. Therefore, this study attempt to present two feature selection methods called modified categorical proportional difference (MCPD) approach that improves conventional CPD method, and balance category feature (BCF) strategy that equally selects attributes from both positive and negative examples, to improve sentiment classification performances. Finally, several real sentiment cases of text reviews will be provided to demonstrate the effectiveness of our proposed methods. Results showed that the combination of proposed BCF strategy and MCPD method can not only remarkably reduce feature space, but also improve the sentiment classification performance.
... where A is the frequency that f and c occurs together, B is the frequency that f occurs without c, C is the frequency that c occurs without f, D is the frequency where neither c nor f occurs and lastly N is the total number of documents. Simeon and Hilderman proposed the Categorical Proportional Differences (CPD) method to measure the usefulness of a term to differentiate between different categories [9]. The measurement is based on the ratio of the frequencies of a word occurring across different categories of documents. ...
... When CPD score approaches maximum score of 1, this shows that the feature occurs more frequently in documents of a particular category only and is helpful to distinguish between categories. CPD is quite a recent method and has been used in a few experiments only [9,10]. The benefit of CPD is that it can eliminate common terms with high document frequency but are not important such as stop-words, based on their equal occurrence in all classes of documents [11]. ...
Chapter
Full-text available
Document sentiment analysis is the task of determining whether a document has a positive, negative or neutral sentiment. It is made up of subtasks including feature extraction, feature selection and sentiment classification. Feature selection is the task of selecting relevant features that can aid the classifier to produce better results. This paper focuses on comparing the classification performances based on several feature selection methods used to select relevant features and also to minimize the document-term matrix representation of the documents. The purpose of applying feature selection besides selecting relevant features is also to reduce the number of features to preserve the efficiency of the whole system. In this work, the experiment setup is designed in order to investigate the effectiveness of several selected feature selection methods in improving the sentiment analysis results. Based on the findings from the experiment, although common feature selection methods such as Document Frequency (DF), Information Gain (IG) and Chi-Squared Statistics (CSS) are found to be able to produce high sentiment analysis accuracy, the Categorical Probability Proportional Difference (CPPD) method is found to be more effective as it produces higher performance accuracy in classifying the documents based on the sentiments. Although, the Categorical Proportional Difference (CPD) method produces acceptable classification results, it is weak in reducing the number of features. In short, the CPPD method enables the sentiment analysis task to be conducted with higher accuracy rate couples with high feature reduction rate too.
... Dan diberbagai penelitian juga sering ditemukan menggunakan metode seleksi fitur untuk mengurangi dimensi dan mempercepat proses perhitungan. Selain itu dengan menggunakan seleksi fitur kita bisa meningkatakan ke-efisienan dan ke-akuratan dalam proses extract suatu dokumen yang subset dengan pemilihan fitur yang dianggap lebih relevan (Simeon, 2008) . Pada penelitian yang dilakukan oleh simeon tersebut adalah membandingkan beberapa metode seleksi fitur dan salah satunya adalah Categorical Proportional Difference (CPD). ...
... Berdasarkan uraian di atas, pada penelitian ini peneliti menggunakan metode Naïve Bayes dikarenakan tingkat akurasi yang lebih baik dan dilakukan seleksi fitur sebelum melakukan klasifikasi dengan menggunakan metode Categorical Proportional Difference (CPD) untuk mengukur derajat kontribusi sebuah kata guna membedakan apakah kata layak diprioritaskan untuk dilakukan klasifikasi atau tidak. Alasan penggunaan metode CPD adalah karena metode ini bisa digunakan untuk menemukan kata yang banyak terjadi dalam sebuah kelas dokumen, dengan menggunakan frekuensi dokumen positif dan frekuensi dokumen negatif (Simeon, 2008). Dengan adanya penelitian ini, diharapkan permasalahan dalam menganalisa dan mengevaluasi pandangan seseorang terhadap sebuah produk, sehingga diketahui kelemahan produk dari sudut pandang pengguna dan bisa meningkatkan daya guna serta penjualan produk tersebut. ...
Article
Full-text available
Produk kecantikan pada saat ini menjadi hal yang populer di berbagai kalangan, terutama pada kalangan wanita. Hampir kebanyakan dari mereka memiliki produk kecantikan dan termasuk sebagai kebutuhan utama untuk menunjang penampilan mereka yang lebih baik lagi. Adanya suatu produk tidak terlepas dari sebuah komentar atau review dari konsumen untuk produk tersebut. Tentunya dengan adanya review tersebut bisa membantu konsumen untuk lebih selektif lagi dalam memilih suatu produk. Dan dari pihak produksi bisa terbantu untuk mengukur seberapa jauh kualitas produk yang mereka hasilkan. Namun dari pihak produksi sendiri terkadang mengalami kesulitan dalam memilah dan mengkategorikan review, apakah produk tersebut kualitasnya tergolong bagus, cukup bagus, tidak bagus, dan sebagainya. Dalam penelitian ini penilaian suatu produk berdasarkan review yang diberikan adalah rating. Sehingga dibutuhkan sebuah sistem prediksi rating untuk memprediksi dan menentukan rating yang tepat berdasarkan review yang diberikan oleh user terhadap suatu produk. Untuk mendukung sistem yang dibangun dibutuhkan metode untuk menyelesaikan permasalahan tersebut, dalam penelitian ini peneliti menggunakan metode Naïve Bayes dan Categorical Proportional Difference. Naïve Bayes adalah metode untuk klasifikasi sedangkan Categorical Proportional Difference adalah seleksi fitur untuk lebih mengoptimalkan hasil dari klasifikasi. Dari hasil pengujian, didapat tingkat akurasi terbaik pada saat penggunaan fitur sebesar 50% dengan tingkat akurasi sebesar 87%. Hasil tersebut adalah hasil terbaik dari hasil dengan rasio penggunaan fitur yang lain yaitu sebesar 25%, 75% dan 100%. Dari hasil tersebut CPD terbukti bisa melakukan pemilihan kata yang dianggap relevan maupun tidak relevan untuk dilakukan klasifikasi. Beauty products at this time become a popular thing in various circles, especially among women. Almost all of them have beauty products and are included as a primary requirement to support their better performances. The existence of a product can not be separated from a comment or review of the consumer for the product. Of course with the review can help consumers to be more selective again in choosing a product. And from the production side can be helped to measure how far the quality of the products they produce. But from the production itself sometimes have difficulty in sorting and categorize the review, whether the product is good quality, good enough, not good, and so forth. In this study the assessment of a product based on the review given is rating. So it takes a rating prediction system to predict and determine the right rating based on the reviews given by the users of a product. To support the system built required methods to solve the problem, in this study researchers used the method of Naïve Bayes and Categorical Proportional Difference. Naïve Bayes is a method for classification whereas Categorical Proportional Difference is a feature selection to further optimize the results of classification. From the test results, obtained the best accuracy level when the use of features by 50% with an accuracy of 87%. These results are the best results of the results with other feature usage ratios of 25%, 75% and 100%. From these results CPD proven to make the selection of words that are considered relevant or irrelevant to do classification.
... where A is the frequency that f and c occurs together, B is the frequency that f occurs without c, C is the frequency that c occurs without f, D is the frequency where neither c nor f occurs and lastly N is the total number of documents. Simeon and Hilderman proposed the Categorical Proportional Differences (CPD) method to measure the usefulness of a term to differentiate between different categories [9]. The measurement is based on the ratio of the frequencies of a word occurring across different categories of documents. ...
... When CPD score approaches maximum score of 1, this shows that the feature occurs more frequently in documents of a particular category only and is helpful to distinguish between categories. CPD is quite a recent method and has been used in a few experiments only [9,10]. The benefit of CPD is that it can eliminate common terms with high document frequency but are not important such as stop-words, based on their equal occurrence in all classes of documents [11]. ...
Chapter
Full-text available
Sentiment analysis is the task of classifying documents according to their sentiment polarity. Before classification of sentiment documents, plain text documents need to be transformed into workable data for the system. This step is known as feature extraction. Feature extraction produces text representations that are enriched with information in order to have better classification results. The experiment in this work aims to investigate the effects of applying different sets of features extracted and to discuss the behavior of the features in sentiment analysis. These features extraction methods include unigrams, bigrams, trigrams, Part-Of-Speech (POS) and Sentiwordnet methods. The unigrams, part-of-speech and Sentiwordnet features are word based features, whereas bigrams and trigrams are phrase-based features. From the results of the experiment obtained, phrase based features are more effective for sentiment analysis as the accuracies produced are much higher than word based features. This might be due to the fact that word based features disregards the sentence structure and sequence of original text and thus distorting the original meaning of the text. Bigrams and trigrams features retain some sequence of the sentences thus contributing to better representations of the text.
... In this paper, we aim to construct a more accurate common subspace, and then train classifier for the target domain based on the common subspace. Given a labeled source domain S and an unlabeled target domain T, we can compute the SO (sentiment orientation) of features in source domain using methods CPD (Categorical Proportional Difference) [9] or OR (odds ratio) [10]. Then we predict the sentiment orientation of these features in the target domain base on the relationship of co-occurrence, and the common subspace is constructed according to the consistency of sentiment orientation between different domains. ...
... In this subsection we will select the domain-independent features to construct common subspace according their sentiment orientation in both domains. Normally, we can use the difference of frequency in positive text and negative text to represent the sentiment orientation of features, such as CPD (Categorical Proportional Difference) [9] and OR (odds ratio) [10]. In this paper we use CPD to calculate the sentiment orientation of features, its formula is shown in Eq. (1): ( 3) shows. ...
... However, in many cases, the performance of the classification task is still poor. Therefore, in this study, a GA-based metaheuristic optimization algorithm is used for feature selection effectively as the advanced binary meta-heuristics models proposed in [13][14][15][16]. ...
Article
Full-text available
Multilabel classification of Arabic text is an important task for understanding and analyzing social media content. It can enable the categorization and monitoring of social media posts, the detection of important events, the identification of trending topics, and the gaining of insights into public opinion and sentiment. However, multilabel classification of Arabic contents can present a certain challenge due to the high dimensionality of the representation and the unique characteristics of the Arabic language. In this paper, an effective approach is proposed for Arabic multilabel classification using a metaheuristic Genetic Algorithm (GA) and ensemble learning. The approach explores the effect of Arabic text representation on classification performance using both Bag of Words (BOW) and Term Frequency-Inverse Document Frequency (TF-IDF) methods. Moreover, it compares the performance of ensemble learning methods such as the Extra Trees Classifier (ETC) and Random Forest Classifier (RFC) against a Logistic Regression Classifier (LRC) as a single and ensemble classifier. We evaluate the approach on a new public dataset, namely, the MAWQIF dataset. The MAWQIF is the first multilabel Arabic dataset for target-specific stance detection. The experimental results demonstrate that the proposed approach outperforms the related work on the same dataset, achieving 80.88% for sentiment classification and 68.76% for multilabel tasks in terms of the F1-score metric. In addition, the data augmentation with feature selection improves the F1-score result of the ETC from 65.62% to 68.80%. The study shows the ability of the GA-based feature selection with ensemble learning to improve the classification of multilabel Arabic text.
... The required and restricted data analysis tools analyze various financial products and determine the recommended results. The literature introduces the reform of the financial system and the gradual completion of the regulatory system, which will put forward new and higher requirements for the management and development of the banking industry [19]. In order to meet the needs of customers for the telecommunications environment, more and more financial products have been launched, and the bank has paid more attention to customer service and marketing, while improving the management level [20]. ...
Preprint
Full-text available
Collaborative filtering recommendation is a technology that has rapidly appeared in information filtering and information systems in recent years. At present, it is widely used in commercial activities and has achieved very satisfactory results. The research of this article is based on the basic operating system method and the suggestion of the automatic recognition system (collaborative filtering recommendation), that is, customers purchase fixed deposits. Based on behavioral theory and new institutional arrangements, this article explores the influence and effect of the external development of external digital banks on the digital behavior of traditional commercial banks, and concludes that the development of digital banks has a positive impact on bank operations and product differentiation innovation. The economic pressure brought about by the development of digital banks first promoted the bank's product innovation, while the social pressure mechanism affected the bank's digital innovation in management and production. Social pressure has an impact on bank management and digital innovation. The improvement of financial business transparency and the diversification of financial products have also increased competition in the financial market. Accurately predicting customer preferences is crucial for financial business companies. The development of an effective classification model will not only help increase company profits, but also effectively reduce costs. In the user-based collaborative filtering system research algorithm, by establishing a time-series-based consumer network, determine the targeted influence relationship between users to find the neighbor set more accurately, and establish a time-series-based collaborative filtering algorithm to improve recommendation the accuracy of the algorithm.
... CPD measures the degree of the term which contributes to differentiating the categories from the corpus [47]. Allotey [48] uses the CPD value for sentiment analysis and classification of online reviews. ...
Article
Full-text available
Authorship attribution has been largely investigated based on writing style analysis to identify the author of a given document. This paper describes a supervised approach to identify the lyricist for a given Tamil film lyric document for the first time. In addition to statistical features, linguistic, poetic and semantic features have been used to identify the lyricists. The accuracy of the system was improved by incorporating different classification models with different feature selection methods. The evaluation was carried out using 15,286 lyric documents for a set of 113 lyricists and the performance of the system was determined based on precision, recall and F-measure. The experimental results suggest that the support vector machine (SVM) method can be used to achieve better accuracy compared to the other methods investigated.
... ▪ Categorical Proportional Difference (CPD): This method was introduced by authors in (Simeon & Hilderman, 2008) to define the effective influence of each feature in expressing a proper class. The frequency of each feature is separately calculated. ...
Article
Full-text available
With the pervasive growth of web-based businesses, sentiment analysis of online reviews has attracted increasing interest among text mining experts. The problem is complicated when these reviews are in the Persian language since all existing works are focused on the English language, leaving other languages to multilingual models with limited resources. Due to these drawbacks, we try to give an insight regarding different stages of Persian Sentiment Analysis. This study presents a taxonomy of all Persian Sentiment Analysis works considering the most common techniques. The four steps are considered, namely, pre-processing, feature engineering, lexicon generation, and classification. As a result, we reveal that newer works focus on deep learning methods. Also, we suggest applying other methods such as heuristic and hybrid approaches to be worthwhile for the performance of classification in Persian Sentiment Analysis. Finally, we summarize the most important issues in this domain including the lack of dataset, lexicon, tools, etc.
... We investigate whether there is a relation between feature selection algorithms commonly used for text classification and the most attended words in the fine-tuned language models. We center our analysis on four feature selection methods used for text classification [14,18,22]: Chi-square (chi), Information Gain (ig), Document Frequency (df), and Categorical Proportional Difference (pd). Chi-square measures the lack of independence between a word and a class; its value is zero if the word and the class are independent. ...
Chapter
Full-text available
We investigate the self-attention mechanism of BERT in a fine-tuning scenario for the classification of scientific articles over a taxonomy of research disciplines. We observe how self-attention focuses on words that are highly related to the domain of the article. Particularly, a small subset of vocabulary words tends to receive most of the attention. We compare and evaluate the subset of the most attended words with feature selection methods normally used for text classification in order to characterize self-attention as a possible feature selection approach. Using ConceptNet as ground truth, we also find that attended words are more related to the research fields of the articles. However, conventional feature selection methods are still a better option to learn classifiers from scratch. This result suggests that, while self-attention identifies domain-relevant terms, the discriminatory information in BERT is encoded in the contextualized outputs and the classification layer. It also raises the question whether injecting feature selection methods in the self-attention mechanism could further optimize single sequence classification using transformers.
... We investigate whether there is a relation between feature selection algorithms commonly used for text classification and the most attended words in the fine-tuned language models. We center our analysis on four feature selection methods used for text classification [14,18,22]: Chi-square (chi), Information Gain (ig), Document Frequency (df), and Categorical Proportional Difference (pd). Chi-square measures the lack of independence between a word and a class; its value is zero if the word and the class are independent. ...
Preprint
Full-text available
We investigate the self-attention mechanism of BERT in a fine-tuning scenario for the classification of scientific articles over a taxonomy of research disciplines. We observe how self-attention focuses on words that are highly related to the domain of the article. Particularly, a small subset of vocabulary words tends to receive most of the attention. We compare and evaluate the subset of the most attended words with feature selection methods normally used for text classification in order to characterize self-attention as a possible feature selection approach. Using ConceptNet as ground truth, we also find that attended words are more related to the research fields of the articles. However, conventional feature selection methods are still a better option to learn classifiers from scratch. This result suggests that, while self-attention identifies domain-relevant terms, the discriminatory information in BERT is encoded in the contextualized outputs and the classification layer. It also raises the question whether injecting feature selection methods in the self-attention mechanism could further optimize single sequence classification using transformers.
... The CPPD includes two separate approaches, the CPD (Categorical Proportional Different) and the Probability Proportional Difference (PPD) techniques. The Categorical Proportional Difference (CPD) approach is based on the estimation of the degree of the classindiscriminating contribution word and only the maximum contribution word is included in the classification [17]. The PPD methodology focuses on measuring the degree to which a word belongs to a specified class and the variance is a test of the ability to differentiate between them. ...
Article
Full-text available
Undoubtedly that the huge business data could make data analysis becomes more complicated such that the decision-making process would be out of reach. This condition happens. In the fields of consumer buying behavior, A well-known method called sentiment analysis can help in extracting information about the up-to-date trends and is able to increase market value of product through improving its quality. One of the approaches in solving the sentiment analysis is feature selection technique. However, this technique contains a combinatorial behavior and the analysis of the huge data can experience uncertainty parameter. This paper describes a framework for solving the sentiment analysis based on feature selection approach using a stochastic combinatorial programming.
... Following the seminal STUCCO algorithm, Bay proposed an initial data discretization method for CSM [1]. Simeon and Hilderman proposed a slightly modified equal width binning interval to discretize continuous variables [9]. ...
Preprint
Facebook operates a family of services used by over two billion people daily on a huge variety of mobile devices. Many devices are configured to upload crash reports should the app crash for any reason. Engineers monitor and triage millions of crash reports logged each day to check for bugs, regressions, and any other quality problems. Debugging groups of crashes is a manually intensive process that requires deep domain expertise and close inspection of traces and code, often under time constraints. We use contrast set mining, a form of discriminative pattern mining, to learn what distinguishes one group of crashes from another. Prior works focus on discretization to apply contrast mining to continuous data. We propose the first direct application of contrast learning to continuous data, without the need for discretization. We also define a weighted anomaly score that unifies continuous and categorical contrast sets while mitigating bias, as well as uncertainty measures that communicate confidence to developers. We demonstrate the value of our novel statistical improvements by applying it on a challenging dataset from Facebook production logs, where we achieve 40x speedup over baseline approaches using discretization.
... The feature selection model which is based on CPPD is the combination of "Probability Proportion Difference (PPD)" and "Categorical Proportional Different (CPD)". The work [22] presents that CPD model calculates the degree to which the word contributes in discerning the class and the contributing words which are on the top are chosen for the classification. On the other hand, the method PPD calculates the degree of probability to which the word belongs to specific class and the word or term with greater belongingness degree are taken into consideration for classification. ...
... CPD [22] calculates the degree to which a feature distinguishes a specific category from other categories. The attainable values for CPD are limited to the interval (-1, 1). ...
Article
Full-text available
The dimensionality of the feature space exhibits a significant effect on the processing time and predictive performance of the Malware Detection Systems (MDS). Therefore, the selection of relevant features is crucial for the classification process. Feature Selection Technique (FST) is a prominent solution that effectively reduces the dimensionality of the feature space by identifying and neglecting noisy or irrelevant features from the original feature space. The significant features recommended by FST uplift the malware detection rate. This paper provides the performance analysis of four chosen filter-based FSTs and their impact on the classifier decision. FSTs such as Distinguishing Feature Selector (DFS), Mutual Information (MI), Categorical Proportional Difference (CPD), and Darmstadt Indexing Approach (DIA) have been used in this work and their efficiency has been evaluated using different datasets, various feature-length, classifiers, and success measures. The experimental results explicitly indicate that DFS and MI offer a competitive performance in terms of better detection accuracy and that the efficiency of the classifiers does not decline on both the balanced and unbalanced datasets.
... Categorical Proportional Difference (CPD) In order to determine the impact of each feature (unigrams) in representing a class, the CPD method was proposed by [104]. The frequency of each feature in each class (positive or negative) is separately calculated and the polarized words, which occur dominantly in a class, have a higher PD value, while those words distributed equally in both classes have a lower PD value. ...
Article
Full-text available
Natural language processing (NLP) techniques can prove relevant to a variety of specialties in the field of cognitive science, including sentiment analysis. This paper investigates the impact of NLP tools, various sentiment features, and sentiment lexicon generation approaches to sentiment polarity classification of internet reviews written in Persian language. For this purpose, a comprehensive Persian WordNet (FerdowsNet), with high recall and proper precision (based on Princeton WordNet), was developed. Using FerdowsNet and a generated corpus of reviews, a Persian sentiment lexicon was developed using (i) mapping to the SentiWordNet and (ii) a semi-supervised learning method, after which the results of both methods were compared. In addition to sentiment words, a set of various features were extracted and applied to the sentiment classification. Then, by employing various well-known feature selection approaches and state-of-the art machine learning methods, a sentiment classification for Persian text reviews was carried out. The obtained results demonstrate the critical role of sentiment lexicon quality in improving the quality of sentiment classification in Persian language.
... CPD [8] value of a term is computed by finding the ratio of the difference between the number of documents of a category in which it appears and the number of documents in which it appears of another category, to the total number of documents in which that term appears. CPD value for a feature can be calculated by using equation (1) Here, posD is the number of positive review document in which a term appears, and negD is the number of negative review documents in which that term appear. ...
... 11. Categorical Proportional Difference (cpd) cpd is a ratio that considers the number of documents of a category in which the feature occurs and the number of documents from other categories in which the feature also occurs (Simeon and Hilderman, 2008). 12. Multinomial Z Score (zd) zd supposes that a feature follows binomial distribution, calculates Z transformation for a feature in each class, zd boosts the highly unevenly distributed features among the classes, it gives high positive score for a feature in the class where it is highly frequent and negative score in the class where it rarely appears (Hamdan et al., 2014;Savoy, 2012). ...
Article
Full-text available
Term weighting metrics assign weights to terms in order to discriminate the important terms from the less crucial ones. Due to this characteristic, these metrics have attracted growing attention in text classification and recently in sentiment analysis. Using the weights given by such metrics could lead to more accurate document representation which may improve the performance of the classification. While previous studies have focused on proposing or comparing different weighting metrics at two-classes document level sentiment analysis, this study propose to analyse the results given by each metric in order to find out the characteristics of good and bad weighting metrics. Therefore we present an empirical study of fifteen global supervised weighting metrics with four local weighting metrics adopted from information retrieval, we also give an analysis to understand the behavior of each metric by observing and analysing how each metric distributes the terms and deduce some characteristics which may distinguish the good and bad metrics. The evaluation has been done using Support Vector Machine on three different datasets: Twitter, restaurant and laptop reviews.
... Hence, we compile a list of these words and exclude them when calculating tweet polarities. We detect such words with the help of categorical proportional difference (CPD) (Simeon and Hilderman, 2008) which describes how much a word w contributes to distinguishing the different classes. It is calculated as CP D w = |A − B|/(A + B), where A corresponds to the occurrences of the word w.r.t. ...
Conference Paper
Full-text available
This paper describes our approach to the SemEval 2016 task 4, " Sentiment Analysis in Twitter " , where we participated in subtask A. Our system relies on AlchemyAPI and Senti-WordNet to create 43 features based on which we select a feature subset as final representation. Active Learning then filters out noisy tweets from the provided training set, leaving a smaller set of only 900 tweets which we use for training a Multinomial Naive Bayes classifier to predict the labels of the test set with an F1 score of 0.478.
... cpd is a ratio that considers the number of documents of a category in which the feature occurs and the number of documents from other categories in which the feature also occurs (Simeon et Hilderman, 2008). ...
Thesis
Full-text available
In this thesis, we address the problem of sentiment analysis. More specifically, we are interested in analyzing the sentiment expressed in social media texts such as tweets or customer reviews about restaurant, laptop, hotel or the scholarly book reviews written by experts. We focus on two main tasks : sentiment polarity detection in which we aim to deter- mine the polarity (positive, negative or neutral) of a given text and the opinion target extraction in which we aim to extract the opinion targets that people tend to express their opinions towards them (e.g. food, pizza and service are opinion targets in restau- rant reviews). Our main objective is constructing state-of-the-art systems which can do the two tasks. Therefore, for evaluation purpose, we have participated at an International Work-shop on Semantic Evaluation (SemEval), we have chosen two tasks : (1) Sentiment ana- lysis in Twitter in which we seek to determine the polarity of tweet and (2) Aspect-Based sentiment analysis which aims to extract the opinion targets in restaurant reviews, then to determine the polarity of each target. We have also applied and evaluated our me-thods using a French book reviews corpus constructed by OpenEdition team in which we extract also the opinion targets and their polarities. Our proposed methods are supervised for both tasks : 1. For polarity sentiment detection, we address three points : term weighting, feature extraction and classification method. We first study several supervised term weigh- ting metrics and analyze the behavior of term weighting metric which could give a good performance. Then, we enrich the document representation by extracting several groups of features. As the features extracted from sentiment lexicons seem to be the most influential features, we propose a new metric called natural entropy to construct an automatic sentiment lexicon from noisy labeled Twitter corpus and combine the features extracted from this lexicon to improve the performance. The evaluation demonstrates that this rich feature extraction process can produce a state-of-the-art system in sentiment analysis. After these experiments with term weighting and features extraction with classic classification methods such as Sup- port Vector Machines and Logistic Regression, we have found that it is difficult to understand the decision of those classic methods. Therefore, we propose a simple and an interpretable model for estimating the polarity of text. This new model relies on bottom-up approach where it goes from word polarity to text polarity de- tection. Our first experiments will show that this new model seems to be promising and could outperform the classic methods. 2. For opinion target extraction, we adopt a Conditional Random Field model with feature extraction process, most of the extracted feature have proved their per- formance in entity extraction problem. We applied this model for extracting the opinion targets in English restaurant reviews and French scholarly book reviews.
... CPD 32,22 measures how well a term can be used to differentiate between different categories based on the ratio of the word occurrence frequency in the different categories of documents as shown below, where A is the number of time word and category co-occur and B is the number of time word occurs without the category. ...
Article
Full-text available
Text documents are normally represented as a feature-document matrix in sentiment analysis. Features can be single words from the text document or more complex pairs extracted by different schemes that adds information in order to enrich the feature-document matrix representation. Having diverse feature types however creates a problem of high dimensionality due to the vast number of features and relations they hold. Thus, feature selection helps in ensuring that effective and efficient sentiment analysis applications can be developed by selecting features that are relevant and informative to assist classifiers to perform better and to reduce the processing load by narrowing down the feature set. This paper highlights methods used for feature selection, namely filter, wrapper and embedded. Prior to feature selection, preprocessing techniques are performed to reduce the amount of features first. This paper is concluded by summarizing this review and outlining the challenges faced and proposing the ensemble feature selection method for sentiment analysis data.
... CPD has shown good performance 32,24 , however CPD takes values of feature occurrence into consideration only, whereas other feature selection methods also considered the statistics of the feature absence when measuring. ...
Article
Full-text available
Sentiment analysis is an important task for the automated classification of positive and negative opinions by a machine. The approaches to this task can either be rule-based or machine learning, the later being the current trend due to its automaticity and versatility. An ensemble framework based on machine learning classifiers diversifies the text data and learning process in order to produce accurate sentiment classification. In this review paper, we highlight the various parts to consider when creating an ensemble of machine learning classifiers for sentiment analysis. Starting from the selection of suitable text features until the training of the machine learner, they all influence the accuracy of the system. The ensemble framework improves the accuracy of the system based on the concept that stronger classifiers compensate for the performance of weak classifiers. Concluding thought to the paper is the inclusion of diversified feature selection method into the ensemble framework to select useful features and reduce size of feature set.
... Other methods are to compute a score for each individual features and then pick out a predefined size of feature set according to the rank of scores, such as Chi-square statistic, mutual information, information gain and so on [22,23,25,30,31]. Zheng et al. [26] divided these feature selection methods into one-sided (eg. ...
Article
Full-text available
Bloggers' opinions related to commercial products/services might have a significant influence on consumers' purchasing decisions. Some negative comments could reduce consumers' purchase intentions and bring a great damage to enterprises. But, the comments in blogs are often unstructured, subjective, and hard to comprehend in short time. In some cases, the negative comments are usually fewer than the positive opinions. These fewer negative comments spread very fast and are much harmful. According to a consumer reviews and research online report, 62% of online customers will change their mind about buying a product or service after reading 1∼3 negative reviews. But, when dealing with such imbalanced sentiment data, researchers didn't consider the class imbalance problem. A classifier induced from an imbalanced data set has high classification accuracy for the majority class, but an unacceptable error rate for the minority class. Therefore, to identify consumers' negative sentiments effectively from a large number of online comments had become one of serious issues. So, this study aims to identify the key factors of imbalanced sentiment classification by using Taguchi method. Then, according to the discovered key factors, we'll propose a new feature selection method to improve the performance of imbalanced sentiment classification. Moreover, support vector machines (SVM) have been employed to construct classifiers for identifying bloggers' negative sentiments. Finally, one case study from real world blogs will be provided to illustrate the effectiveness of our proposed approach.
... It is a measurement that takes into consideration the existence of a feature in a class in its calculation. It is introduced by [6] and used by several other researchers such as [7] and [8] . The calculation which is used in [7] was adopted in this study. ...
Article
Full-text available
Opinions about a particular product, service or person are communicated effectively through online media such as Facebook, MySpace and Twitter. Unfortunately only a few researchers had researched on the performance of opinion mining using online messages that were written in Malay Languages. Opinion mining processing that uses Natural Language Processing approach is difficult due to the high content of noisy texts in online messages. On the other hand, opinion mining that uses machine learning approach requires a good feature selection technique since the current filter typed feature selection techniques require interference from the user to select the appropriate features. This study used a feature selection technique based on artificial immune system to select the appropriated features for opinion mining. Experiments with 2000 online movie reviews illustrated that the technique has reduced 90% of the features and improved opinion mining accuracy up to 15% with k Nearest Neighbor classifier and up to 6% with Naïve Baiyes classifier.
... The text categorization problem suffers from high dimensionality and sparsity of the text data. It is very important to find the discriminating terms for each category and to reduce the size of vocabulary by removing the irrelevant terms for effective text categorization [2,3]. In text data each unique term is considered as a feature. ...
Article
Full-text available
Term selection methods in text categorization effectively reduce the size of the vocabulary to improve the quality of classifier. Each corpus generally contains many irrelevant and noisy terms, which eventually reduces the effectiveness of text categorization. Term selection, thus, focuses on identifying the relevant terms for each category without affecting the quality of text categorization. A new supervised term selection approach have been proposed for dimensionality reduction. The method assigns a score to each term of a corpus based on its similarity with all the categories, and then all the terms are ranked accordingly. Subsequently the significant terms of each category are selected to create the final subset of terms irrespective of the size of the category. The performance of the proposed term selection technique is compared with the performance of nine other term selection methods for categorization of several well known text corpora using kNN and SVM classifiers. The empirical results show that the proposed one performs significantly better than the other methods in most of the cases of all the corpora.
... The label information in the source domain will be transferred to the target domain through the two matrices. p l i , p SS i and p T S i (called p in common) are used to represent the polarity of feature i. MI (Mutual Information), CPD (Categorical Proportional Difference) and OR (Odds Ratio) have been used to represent the polarities of features in some efforts [21,23,25]. Considering the range of polarity, here we use OR to represent the polarity. ...
... Categorical proportional distance [17] of a feature t in class C k is defined as, ...
Article
Full-text available
Detection of metamorphic malware is a challenging problem as a result of high diversity in the internal code structure between generations. Code morphing/obfuscation when applied, reshapes malware code without compromising the maliciousness. As a result, signature based scanners fail to detect metamorphic malware. Prior research in the domain of metamorphic malware detection utilizes similarity matching techniques. This work focuses on the development of a statistical scanner for metamorphic virus detection by employing feature ranking methods such as Term Frequency-Inverse Document Frequency (TF-IDF), Term Frequency-Inverse Document Frequency-Class Frequency (TF-IDF-CF), Categorical Proportional Distance (CPD), Galavotti-Sebastiani-Simi Coefficient (GSS), Weight of Evidence of Text (WET), Term Significance (TS), Odds Ratio (OR), Weighted Odds Ratio (WOR) Multi-Class Odds Ratio (MOR) Comprehensive Measurement Feature Selection (CMFS) and Accuracy2 (ACC2). Malware and benign model for classification are developed by considering top ranked features obtained using individual feature selection methods. The proposed statistical detector detects Metamorphic worm (MWORM) and viruses which are generated using Next Generation Virus Construction Kit (NGVCK) with 100% accuracy and precision. Further, relevance of feature ranking methods at varying lengths are determined using McNemar test. Thus, the designed non–signature based scanner can detect sophisticated metamorphic malware, and can be used to support current antivirus products.
... Categorical Proportional Difference [12] of a feature F in class C k is obtained as, ...
Conference Paper
Full-text available
Metamorphic malware modifies the code of every new offspring by using code obfuscation techniques. Recent research have depicted that metamorphic writers make use of benign dead code to thwart signature and Hidden Markov based detectors. Failure in the detection is due to the fact that the malware code appear statistically similar to benign programs. In order to detect complex malware generated with hacker generated tool i.e. NGVCK known to the research community, and the intricate metamorphic worm available as benchmark data we propose, a novel approach using Linear Discriminant Analysis (LDA) to rank and synthesize most prominent opcode bi-gram features for identifying unseen malware and benign samples. Our investigation resulted in 99.7% accuracy which reveals that the current method could be employed to improve the detection rate of existing malware scanner available in public.
... In term of opinion mining, selecting features that are relevant to positive and negative sentiments is also important. Therefore, in this study, each feature was given a value based on a formula that was introduced by Simeon and Hilderman [25] named Categorical Proportional Difference (CPD). Keefe [26] adjusts the formula to fit the two classes case as shown in Equation 1. ...
Article
Full-text available
The number of messages that can be mined from online entries increases as the number of online application users increases. In Malaysia, online messages are written in mixed languages known as ‘Bahasa Rojak’. Therefore, mining opinion using natural language processing activities is difficult. This study introduces a Malay Mixed Text Normalization Approach (MyTNA) and a feature selection technique based on Immune Network System (FS-INS) in the opinion mining process using machine learning approach. The purpose of MyTNA is to normalize noisy texts in online messages. In addition, FS-INS will automatically select relevant features for the opinion mining process. Several experiments involving 1000 positive movies feedback and 1000 negative movies feedback have been conducted. The results show that accuracy values of opinion mining using Naïve Bayes (NB), k-Nearest Neighbor (kNN) and Sequential Minimal Optimization (SMO) increase after the introduction of MyTNA and FS-INS.
Article
Full-text available
When unauthorized copying or stealing of intellectual properties of others happen, it is called plagiarism. Two main approaches are used to counter this problem – external plagiarism detection and intrinsic plagiarism detection. External algorithms compare a suspicious file with numerous sources whereas intrinsic algorithms are allowed to solely inspect the suspicious file in order to predict plagiarism. In this work, the area chosen for detecting plagiarism is with programs or source code files. Copying the entire source code or logic used in a particular program without permissions or copyright is the stealing that happens in the case of source codes. There exist many ways to detect plagiarism in source code files. To perform plagiarism checking for a large dataset, the computational cost is very high and moreover it’s a time consuming job. To achieve a computationally efficient similarity detection in source code files, the Hadoop framework is used where parallel computation is possible for large datasets. But the raw data available to us is not in a suitable form for the existing plagiarism checking tools to work with, as their size is too high and they possess features of big data. Thus a qualifying model is required for the dataset, to be fed into Hadoop so that it could efficiently process them to check for plagiarism in source codes. To generate such a model, machine learning is used which incorporates big data with machine learning.
Preprint
Sentiment analysis is a domain of study that focuses on identifying and classifying the ideas expressed in the form of text into positive, negative and neutral polarities. Feature selection is a crucial process in machine learning. In this paper, we aim to study the performance of different feature selection techniques for sentiment analysis. Term Frequency Inverse Document Frequency (TF-IDF) is used as the feature extraction technique for creating feature vocabulary. Various Feature Selection (FS) techniques are experimented to select the best set of features from feature vocabulary. The selected features are trained using different machine learning classifiers Logistic Regression (LR), Support Vector Machines (SVM), Decision Tree (DT) and Naive Bayes (NB). Ensemble techniques Bagging and Random Subspace are applied on classifiers to enhance the performance on sentiment analysis. We show that, when the best FS techniques are trained using ensemble methods achieve remarkable results on sentiment analysis. We also compare the performance of FS methods trained using Bagging, Random Subspace with varied neural network architectures. We show that FS techniques trained using ensemble classifiers outperform neural networks requiring significantly less training time and parameters thereby eliminating the need for extensive hyper-parameter tuning.
Preprint
Document clustering is a text mining technique used to provide better document search and browsing in digital libraries or online corpora. A lot of research has been done on biomedical document clustering that is based on using existing ontology. But, associations and co-occurrences of the medical concepts are not well represented by using ontology. In this research, a vector representation of concepts of diseases and similarity measurement between concepts are proposed. They identify the closest concepts of diseases in the context of a corpus. Each document is represented by using the vector space model. A weight scheme is proposed to consider both local content and associations between concepts. A Self-Organizing Map is used as document clustering algorithm. The vector projection and visualization features of SOM enable visualization and analysis of the clusters distributions and relationships on the two dimensional space. The experimental results show that the proposed document clustering framework generates meaningful clusters and facilitate visualization of the clusters based on the concepts of diseases.
Article
Full-text available
Sambat Online merupakan fasilitas yang berfungsi untuk menampung saran, kritik, keluhan atau pertanyaan dari masyarakat kota Malang seputar Pemerintah Kota Malang melalui situs web yang sudah disediakan atau melalui pesan singkat kepada nomor yang sudah disediakan. Suatu teks pengaduan yang masuk akan dikategorikan ke dalam berbagai bidang SKPD yang bertanggung jawab, untuk mempermudah mengorganisir teks pengaduan dan meningkatkan efisiensi waktu administrator dalam memilah dan menentukan bidang SKPD tujuan maka perlu dibuat sistem cerdas yang dapat mengklasifikasikan dokumen sesuai tujuan. K-Nearest Neighbor (K-NN) merupakan metode klasifikasi yang mana akan mencari dokumen yang memiliki kedekatan antara dokumen. Metode seleksi fitur yang digunakan adalah menggunakan metode Categorical Proportional Difference (CPD) untuk mengukur derajat kontribusi sebuah kata. Proses yang dilakukan adalah mengumpulkan dokumen latih dan dokumen uji, melakukan tahap preprocessing dan seleksi fitur, pembobotan, kemudian dilakukan klasifikasi, dan pada tahap akhir dilakukan pengujian dan analisis terhadap hasil klasifikasi oleh sistem terkait nilai accuracy, precision, recall, dan F-Measure. Hasilnya kinerja yang paling optimal adalah penggunaan k=1 dengan feature sebanyak 100% sebesar 91,84%, yang mana nilai akurasinya lebih baik dibandingkan dengan adanya seleksi fitur karena adanya penghapusan term yang memiliki nilai CPD yang rendah. Sambat Online is a platform to facilitate the suggestions, criticisms, complaints or questions from public to the Government of Malang through provided websites or via short messages. Incoming complaints, will be categorized into various fields of SKPD. To make it easier to organize the text and increase the efficiency of the administrator in sorting out and define the field of SKPD, an intelligent systems that can classify documents according to its SKPD's field is needed. K-Nearest Neighbor (K-NN) is a method of classification that will be used to find similarities between documents. Feature selection method used in this research is Categorical Proportional Difference (CPD) to measure the degree of contribution of a word. Started from collecting the test documents and training documents, continue to the preprocessing stage and selection features, weighting, and then do the classification, and analysing the results of the classification system by value of accuracy, precision, recall, and F-Measure. The result is the most optimal performance is the use of k = 1 with featured as much as 100% of 91.84%, which shows better value compared to the featured selection due to the removal of the term with low CPD value.
Article
The development of online virtual communities has raised the importance in analyzing massive volume of text from websites and social networks. This research analyzed financial blogs and online news articles to develop a public mood dynamic prediction model for stock markets, referencing the perspectives of behavioral finance and the characteristics of online financial communities. This research applies big data and opinion mining approaches to the investors' sentiment analysis in Taiwan. The proposed model was verified using experimental datasets from ChinaTimes.com, cnYES.com, Yahoo stock market news, and Google stock market news over an 18 month period. Empirical results indicate the big data analysis techniques to assess emotional content of commentary on current stock or financial issues can effectively forecast stock price movement.
Chapter
Sentiment analysis research has been increasing tremendously in recent times due to the wide range of business and social applications. Sentiment analysis from unstructured natural language text has recently received considerable attention from the research community. In this chapter, we propose a novel sentiment analysis model based on commonsense knowledge extracted from ConceptNet-based ontology and context information. ConceptNet-based ontology is used to determine the domain-specific concepts which in turn produced the domain-specific important features. Further, the polarities of the extracted concepts are determined using the contextual polarity lexicon which we developed by considering the context information of a word. Finally, semantic orientations of domain-specific features of the review document are aggregated based on the importance of a feature with respect to the domain. The importance of the feature is determined by the depth of the feature in the ontology. Experimental results show the effectiveness of the proposed methods.
Chapter
The field of sentiment analysis is an exciting new research direction due to large number of real-world applications where discovering people’s opinion is important in better decision-making. The development of techniques for the document-level sentiment analysis is one of the significant components of this area. Recently, people have started expressing their opinions on the Web that increased the need of analyzing the opinionated online content for various real-world applications. A lot of research is present in literature for detecting sentiment from the text. Still, there is a huge scope of improvement of these existing sentiment analysis models. Existing sentiment analysis models can be improved further with more semantic and commonsense knowledge.
Chapter
Sentiment analysis from unstructured natural language text has recently received considerable attention from the research community. In the frame of biologically inspired machine learning approaches, finding good feature sets is particularly challenging yet very important. In this chapter, we focus on this fundamental issue of the sentiment analysis task. Specifically, we employ concepts as features and present a concept extraction algorithm to extract semantic features that exploit semantic relationships between words in natural language text. Additional conceptual information of a concept is obtained using the ConceptNet ontology. Concepts extracted from text are sent as queries to ConceptNet to extract their semantics. Further, we select important concepts and eliminate redundant concepts using the Minimum Redundancy and Maximum Relevance feature selection technique. All selected concepts are then used to build a machine learning model that classifies a given document as positive or negative
Chapter
Two types of techniques have been used in the literature for semantic orientation-based approach for sentiment analysis, viz., (i) corpus based and (ii) dictionary or lexicon or knowledge based. In this chapter, we explore the corpus-based semantic orientation approach for sentiment analysis. Corpus-based semantic orientation approach requires large dataset to detect the polarity of the terms and therefore the sentiment of the text. The main problem with this approach is that it relies on the polarity of the terms that have appeared in the training corpus since polarity is computed for the terms that are in the corpus. This approach has been explored well in the literature due to the simplicity of this approach [29, 120]. This approach initially mines sentiment-bearing terms from the unstructured text and further computes the polarity of the terms. Most of the sentiment-bearing terms are multi-word features unlike bag-of-words, e.g., “good movie,” “nice cinematography,” “nice actors,” etc. Performance of semantic orientation-based approach has been limited in the literature due to inadequate coverage of the multi-word features.
Chapter
Sentiment analysis research has attracted a large number of researchers around the globe [61,66, 93, 127]. Sentiment analysis attempts to determine whether a given text is subjective or objective and further, whether a subjective text contains positive or negative opinion. Techniques employed by sentiment analysis models can be broadly categorized into machine learning [93] and semantic orientation approaches [29, 132]. A lot of research has been done for detecting sentiment from the text [67]. Still, there is a huge scope of improvement of these existing sentiment analysis models. The performance of the existing methods can be further improved by including more semantic information.
Chapter
Opinion Mining or Sentiment Analysis is the study that analyzes people's opinions or sentiments from the text towards entities such as products and services. It has always been important to know what other people think. With the rapid growth of availability and popularity of online review sites, blogs', forums', and social networking sites' necessity of analysing and understanding these reviews has arisen. The main approaches for sentiment analysis can be categorized into semantic orientation-based approaches, knowledge-based, and machine-learning algorithms. This chapter surveys the machine learning approaches applied to sentiment analysis-based applications. The main emphasis of this chapter is to discuss the research involved in applying machine learning methods mostly for sentiment classification at document level. Machine learning-based approaches work in the following phases, which are discussed in detail in this chapter for sentiment classification: (1) feature extraction, (2) feature weighting schemes, (3) feature selection, and (4) machine-learning methods. This chapter also discusses the standard free benchmark datasets and evaluation methods for sentiment analysis. The authors conclude the chapter with a comparative study of some state-of-the-art methods for sentiment analysis and some possible future research directions in opinion mining and sentiment analysis.
Article
Full-text available
Filtering of spam emails is a significant operation in email system. The efficiency of this process is determined by many factors such as number of features, representation of samples, classifier etc. This study covers all these factors and aims to find the optimal settings for email spam filtering. Twelve feature selection methods extensively used in text categorization are implemented to synthesize prominent attributes from different categories (i.e. header, subject and body of the mails). Optimal classification performances are obtained for Weighted Mutual Information and Log-TFIDF-Cosine(LTC) feature selection methods for header and body features of the mail with Random Forest and Support Vector Machine classifiers respectively. An overall F1-measure of 0.978 with 0.44s prediction time is achieved when 20% of the original feature length is considered.
Conference Paper
To unfold a solution for the detection of metamorphic viruses (obfuscated malware), we propose a non signature based approach using feature selection techniques such as Categorical Proportional Difference (CPD), Weight of Evidence of Text (WET), Term Frequency-Inverse Document Frequency (TF-IDF) and Term Frequency-Inverse Document Frequency-Class Frequency (TF-IDF-CF). Feature selection methods are employed to rank and prune bi-gram features obtained from malware and benign files. Synthesized features are further evaluated for their prominence in either of the classes. Using our proposed methodology 100% accuracy is obtained with test samples. Hence, we argue that the statistical scanner proposed by us can identify future metamorphic variants and can assist antiviruses with high accuracy.
Article
Text classification is an important part of information retrieval and textmining. The great dimension of feature space causes that the ability of thefeature items distinguishing category is not well. In this paper, we introduce aCPD feature selection method to reduce the feature space dimension. This methoddoes not consider the importance of items in the document and the relevancebetween items. We define the frequency, dispersion, concentration and featureredundancy, using the min-frequency and the mutual information between items toimprove the ability of the items distinguishing categories, to remove theredundancy of feature items, to reduce computational complexity, eventuallyincrease precision rate and recall rate in classification. Bayes classifier isused for text classification, F value is used for the evaluation indexes. Theexperimental result shows that improved CPD is superior to CPD and other featureselection methods. ICIC International
Article
The rapid growth of online social media acts as a medium where people contribute their opinion and emotions as text messages. The messages include reviews and opinions on certain topics such as movie, book, product, politics and so on. Opinion mining refers to the application of natural language processing, computational linguistics, and text mining to identify or classify whether the opinion expressed in text message is positive or negative. Back Propagation Neural Networks is supervised machine learning methods that analyze data and recognize the patterns that are used for classification. This work focuses on binary classification to classify the text sentiment into positive and negative reviews. In this study Principal Component Analysis (PCA) is used to extract the principal components, to be used as predictors and back propagation neural network (BPN) have been employed as a classifier. The performance of PCA+ BPN and BPN without PCA has been compared using Receiver Operating Characteristics (ROC) analysis. The classifier is validated using 10-Fold cross validation. The result shows the effectiveness of BPN with PCA used as a feature reduction method for text sentiment classification.
Article
It is well a known fact that neuropsychiatric disorders cause abnormalities in connectivity patterns of brain regions. Identifying and characterising these abnormalities can be exploited to get better diagnosis of neuropsychiatric diseases with help of resting state functional magnetic resonance imaging (rfMRI) data. But this is not an easy task because rfMRI produces data that has very large dimensions that will lead to curse of dimensionality problem. So it is necessary to reduce the number of features in order to get better classification accuracy. This needs a robust feature selection criterion that best describes the differences between epileptic patients and healthy control group. In this paper we present a classification model in which we introduce a voting based feature selection (VFS) approach that ensures the selection of most discriminative features by combining the capabilities of several feature selection techniques. We used AdaBoost for RBF network as a classifier to avoid over fitting. We applied this model on rfMRI-based data to discriminate between two groups. We correctly classify epileptic patients from healthy controls with 85.33% classification accuracy on a heterogeneous data set using the proposed classification model. The results presented in this paper are better than other reported results in the current literature on this dataset to the best of our knowledge confirming the effectiveness of our classification model.
Article
Our research developed a non signature based approach, employing feature selection methods such as Categorical Proportional Distance (CPD), Weight of Evidence of Text (WET), Term Frequency - Inverse Document Frequency (TF-IDF), Term Frequency - Inverse Document Frequency - Class Frequency (TF-IDF-CF), Galavotti-Sebastiani-Simi Coefficient (GSS) and Term Significance (TS). Classification model is developed by considering bi-gram features ranked with these feature selection techniques. The proposed feature selection approaches detect unseen malware samples with accuracy in the range of 99% to 100%. Relevance of a feature ranking methods on variable feature length is ascertained using McNemar test.
Article
Full-text available
Machine learning for text classification is the cornerstone of document categorization, news filtering, document routing, and personalization. In text domains, effective feature selection is essential to make the learning task efficient and more accurate. This paper presents an empirical comparison of twelve feature selection methods (e.g. Information Gain) evaluated on a benchmark of 229 text classification problem instances that were gathered from Reuters, TREC, OHSUMED, etc. The results are analyzed from multiple goal perspectives-accuracy, F-measure, precision, and recall-since each is appropriate in different situations. The results reveal that a new feature selection metric we call 'Bi-Normal Separation' (BNS), outperformed the others by a substantial margin in most situations. This margin widened in tasks with high class skew, which is rampant in text classification problems and is particularly challenging for induction algorithms. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner faced with a single dataset who seeks to choose one (or a pair of) metrics that are most likely to yield the best performance. From this perspective, BNS was the top single choice for all goals except precision, for which Information Gain yielded the best result most often. This analysis also revealed, for example, that Information Gain and Chi-Squared have correlated failures, and so they work poorly together. When choosing optimal pairs of metrics for each of the four performance goals, BNS is consistently a member of the pair-e.g., for greatest recall, the pair BNS + F1-measure yielded the best performance on the greatest number of tasks by a considerable margin.
Article
Full-text available
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
Conference Paper
Full-text available
We tackle two different problems of text categorization (TC), namely feature selection and classifier induction. Feature selection (FS) refers to the activity of selecting, from the set of r distinct features (i.e. words) occurring in the collection, the subset of r′ ≪ r features that are most useful for compactly representing the meaning of the documents. We propose a novel FS technique, based on a simplified variant of the X2 statistics. Classifier induction refers instead to the problem of automatically building a text classifier by learning from a set of documents pre-classified under the categories of interest. We propose a novel variant, based on the exploitation of negative evidence, of the well-known k-NN method. We report the results of systematic experimentation of these two methods performed on the standard REUTERS-21578 benchmark.
Article
Full-text available
While naive Bayes is quite effective in various data mining tasks, it shows a disappointing result in the automatic text classification problem. Based on the observation of naive Bayes for the natural language text, we found a serious problem in the parameter estimation process, which causes poor results in text classification domain. In this paper, we propose two empirical heuristics: per-document text normalization and feature weighting method. While these are somewhat ad hoc methods, our proposed naive Bayes text classifier performs very well in the standard benchmark collections, competing with state-of-the-art text classifiers based on a highly complex learning method such as SVM
Article
Categorization of documents is challenging, as the number of discriminating words can be very large. We present a nearest neighbor classification scheme for text categorization in which the importance of discriminating words is learned using mutual information and weight adjustment techniques. The nearest neighbors for a particular document are then computed based on the matching words and their weights. We evaluate our scheme on both synthetic and real world documents. Our experiments with synthetic data sets show that this scheme is robust under different emulated conditions. Empirical results on real world documents demonstrate that this scheme outperforms state of the art classification algorithms such as C4.5, RIPPER, Rainbow, and PEBLS.
Article
We propose a set of (machine learning) ML-based scoring measures for conducting feature selection. We've tested these measures on documents from two well-known corpora, comparing them with other measures previously applied for this purpose. In particular, we've analyzed which measure obtains the best overall classification performance in terms of properties such as precision and recall, emphasizing to what extent some statistical properties of the corpus affects performance. The results show that some of our measures outperform the traditional measures in certain situations.
Article
In this paper, we describe an automated learning approach to text categorization based on perceptron learning and a new feature selection metric, called correlation coefficient. Our approach has been tested on the standard Reuters text categorization collection. Empirical results indicate that our approach outperforms the best published results on this Reuters collection. In particular, our new feature selection method yields considerable improvement. We also investigate the usability of our automated learning approach by actually developing a system that categorizes texts into a tree of categories. We compare the accuracy of our learning approach to a rule-based, expert system approach that uses a text categorization shell built by Carnegie Group. Although our automated learning approach still gives a lower accuracy, by appropriately incorporating a set of manually chosen words to use as features, the combined, semi-automated approach yields accuracy close to the rulebased approach. ...
Article
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a 2 -test (CHI), and term strength (TS). We found IG and CHI most effective in our experiments. Using IG thresholding with a knearest neighbor classifier on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classification accuracy (measured by average precision) . DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methods with up to 5...