Comparison between accuracy, precision, recall, and f 1 score of all learning models using the TF-IDF technique.

Comparison between accuracy, precision, recall, and f 1 score of all learning models using the TF-IDF technique.

Source publication
Article
Full-text available
App stores usually allow users to give reviews and ratings that are used by developers to resolve issues and make plans for their apps. In this way, these app stores collect large amounts of data for analysis. However, there are several challenges that must first be addressed, related to redundancy and the volume of data, by using machine learning....

Similar publications

Article
Full-text available
The emergence of crowdfunding has given many capital demanders a new fund-raising channel, but the overall project success rate is very low. Many scholars have begun to discover key suscessful factors of crowdfunding projects. Previous studies have used questionnaires survey to identify important project features. In addition to requiring a lot of...
Article
Full-text available
With the emergence of big data era, the dimensions of data are enhanced exponentially and it becomes a difficult task to handle information of high dimensions in various sectors like text mining, machine learning and data analysis. Redundant and inappropriate feature enhances the complexities in dimensions that further results in poor performances....
Conference Paper
Full-text available
In document-level text mining, feature selection is crucial for lowering ambiguity which in turn enhances classifier performance. The selection of the vital features is crucial, especially for the classification of documents in the morphologically rich Indian regional language Kannada. In this regard, the paper proposes Stacked Ensemble Feature Sel...
Article
Full-text available
Text classification (a.k.a text categorisation) is an effective and efficient technology for information organisation and management. With the explosion of information resources on the Web and corporate intranets continues to increase, it has being become more and more important and has attracted wide attention from many different research fields....
Article
Full-text available
Multi-label classification is the process of specifying more than one class label for each instance. The high-dimensional data in various multi-label classification tasks have a direct impact on reducing the e ciency of traditional multi-label classifiers. To tackle this problem, feature selection is used as an effective approach to retain relevant...

Citations

... (3) Shopify [32]. A well-known e-commerce platform facilitating the creation, operation, and growth of online stores. ...
Article
Full-text available
Web application fingerprint recognition is an effective security technology designed to identify and classify web applications, thereby enhancing the detection of potential threats and attacks. Traditional fingerprint recognition methods, which rely on preannotated feature matching, face inherent limitations due to the ever-evolving nature and diverse landscape of web applications. In response to these challenges, this work proposes an innovative web application fingerprint recognition method founded on clustering techniques. The method involves extensive data collection from the Tranco List, employing adjusted feature selection built upon Wappalyzer and noise reduction through truncated SVD dimensionality reduction. The core of the methodology lies in the application of the unsupervised OPTICS clustering algorithm, eliminating the need for preannotated labels. By transforming web applications into feature vectors and leveraging clustering algorithms, our approach accurately categorizes diverse web applications, providing comprehensive and precise fingerprint recognition. The experimental results, which are obtained on a dataset featuring various web application types, affirm the efficacy of the method, demonstrating its ability to achieve high accuracy and broad coverage. This novel approach not only distinguishes between different web application types effectively but also demonstrates superiority in terms of classification accuracy and coverage, offering a robust solution to the challenges of web application fingerprint recognition.
... This study [19] tests a dataset comprising reviews of Shopify applications. To address the aforementioned constraints, user evaluations are classified into two categories: positive and negative. ...
Article
Full-text available
Nowadays global market products are readily accessible worldwide, and a vast array of reviews across numerous platforms are posted daily in several categories, making it challenging for customers to stay informed about their product interests. To make informed decisions regarding product quality, users require access to reviews and ratings. Owners and managers must analyze customer ratings and the underlying emotional content of reviews to enhance the product’s quality, cost, customer service, and environmental impact. The primary aim of our proposed research is to accurately predict product helpfulness through customer reviews using the Large Language Model (LLM), thereby assisting customers in saving time and money. We employed a benchmark dataset, the Amazon Fine Food Reviews, to develop numerous advanced machine-learning techniques. We introduced a novel transformer approach BERF (BERT Random Forest) for feature engineering to enhance the value of user evaluations for Amazon’s gourmet food products. The BERF method utilizes BERT embeddings and class probability features derived from product helpfulness online reviews textual data. We have balanced the dataset using the Synthetic Minority Over-sampling TEchnique (SMOTE) approach. Our comprehensive study results demonstrated that the Light Gradient Boosting Machine (LGBM) strategy outperformed existing state-of-the-art approaches, achieving an accuracy of 98%. The performance of each method is confirmed using a k-fold method and further improved through hyperparameter optimization. Our innovative study employing a transformer model has significantly enhanced the utility of customer reviews, substantially reducing online product scams and preventing wasted time and money.
... LR is employed to evaluate the probability of class members to confirm the target variable. The logistic function is applied to estimate the probabilities of behavior among independent and dependent variables [37]. The 'solver' variable is set as 'linear' because of linearly separable data. ...
Article
Full-text available
Thyroid disease has been on the rise during the past few years. Owing to its importance in metabolism, early detection of thyroid disease is a task of critical importance. Despite several existing works on thyroid disease detection, the problem of class imbalance is not investigated very well. In addition, existing studies predominantly focus on the binary-class problem. This study aims to solve these issues by the proposed approach where ten types of thyroid diseases are considered. The proposed approach uses a differential evolution (DE)-based optimization algorithm to fine-tune the parameters of machine learning models. Moreover, conditional generative adversarial networks are used for data augmentation. Several sets of experiments are carried out to analyze the performance of the proposed approach with and without model optimization. Results suggest that a 0.998 accuracy score can be obtained using AdaBoost with DE optimization which is better than existing state-of-the-art models.
... Following data extraction, the first step is to remove any unwanted or excessive information from the data that does not contribute to the target class prediction. A preprocessing pipeline is used to clean the acquired data for this purpose [25], [26]. The steps that were followed in the sentiment analysis are shown in figure 2 above. ...
Article
Full-text available
Social media platforms and Microblogging sites can be used to gather public opinion and sentiment on a range of topics, including the current state of affairs in war-torn countries. During a crisis, Online Social Networks (OSNs) play a critical role in information sharing. The information gathered during such a crisis, public opinion and sentiments on a large scale can be reflected. Twitter, in particular, contains a significant quantity of geo tagged tweets, allowing for sentiment analysis over time and geography. The primary goal of this research study is to harness the power of social media to monitor, examine, and analyze public opinion on a recent "Russia's Invasion on Ukraine", as public opinion is crucial in forming government policy. By delving deeper into social media, one may readily study people's behavior on a variety of subjects and policies, which would be impossible to do otherwise using traditional sources. In this research paper, it is aimed to classify the viewpoint as Positive, Negative, or Neutral by using Machine learning techniques (Lexicon based) with Natural Language Processing (NLP). The findings of this study can assist various organizations and stockholders in improving their political strategies and commercial decision-making for current and future intents by utilizing social media networks as a valuable source of knowledge.
... Given a spatial-text database D containing many spatial-text objects o, the keywords of all o in D are put into a set to form a document. The weight of each keyword in the document is calculated using the TF-IDF [33] method in the field of natural language processing, which is subsequently used to sort the keywords and construct Huffman trees and keyword encoding. Definition 9. ...
... is the keyword document of the current region, denoted as [33] method in the field of natural language processing, which is subsequently used to sort the keywords and construct Huffman trees and keyword encoding. ...
Article
Full-text available
Aiming at the problem that the existing spatial keyword group query problem did not consider the query requirements with exclusion keywords and time attributes, a time-aware group query problem with exclusion keywords (TEGSKQ) is proposed for the first time. To solve this problem effectively, this paper proposes a query method based on the EKTIR-Tree index and dominating group (EKTDG). This method first proposes the EKTIR-tree index, which incorporates Huffman coding and integrates Bloom filters to deal with excluded keywords in order to improve the hit rate of keyword queries, significantly improving the query efficiency and reducing the storage occupancy. Then, the Candidate algorithm is proposed based on the EKTIR-tree index to filter out the spatial–textual objects that meet the query’s keywords and time requirements, narrowing the search space for subsequent queries on a large scale. To address the problem of the low efficiency of existing algorithms based on a spatial distance query, a distance-dominating group is defined and a pruning algorithm based on a spatial distance-dominating group is proposed, which is a refining process of query results and greatly improves the search efficiency of the query. Theoretical and experimental studies show that the proposed method can better handle group queries with exclusion keywords based on time awareness.
... This analysis can be compared with the existing systems like Random Forest [16], Extra Tree Classifier [17], Support Vector Machine [18], Logistic Regression [19], and Decision Tree Classifier [20] from the following table 5 respectively. [16] Bleeding 0.950 ETC [17] Bleeding 0.957 SVM [18] Bleeding, Ulcer 0.983 LR [19] Bleeding 0.976 DTC [20] Abnormality 0.894 BIR Bleeding 0.993 ...
... This analysis can be compared with the existing systems like Random Forest [16], Extra Tree Classifier [17], Support Vector Machine [18], Logistic Regression [19], and Decision Tree Classifier [20] from the following table 5 respectively. [16] Bleeding 0.950 ETC [17] Bleeding 0.957 SVM [18] Bleeding, Ulcer 0.983 LR [19] Bleeding 0.976 DTC [20] Abnormality 0.894 BIR Bleeding 0.993 ...
... In existing research, most studies focus on conducting generalized user satisfaction and demand research through sentiment analysis based on user feedback data obtained from various social media platforms and app stores [12,15,17,28,37]. Timoshenko A and Hauser J R (2019) [5] used machine learning and deep learning methods to prove that machine learning techniques can improve the e ciency of identifying user needs from user feedback corpora. ...
... Guan R, Zhang H, Liang Y, et al. (2020) [14] proposed a new deep learning framework, and the experimental results veri ed that deep learning models outperform machine learning models in text classi cation. The data mining work of user feedback data mainly focuses on classi cation research, including sentiment classi cation, topic classi cation, and function classi cation [15,36], etc. Zhai Y, Song X, Chen Y, et al. [37] and Rustam F, Mehmood A, Ahmad M, et al. [12] attempted to generate various sentiment-related topics using topic models and then conducted further information mining. However, since the data sources in the aforementioned studies are user reviews from app stores and social media platforms, it is di cult to focus and apply these ndings to the analysis of speci c product quality issues. ...
... Public user feedback datasets [1,15,28,29,36] usually contain user comments from various social media platforms and app stores, which are characterized by emotional and overly broad suggestions, making them unsuitable for in-depth research on speci c applications [6, 12,15,23,37]. Therefore, it is necessary to design and construct a user feedback dataset speci cally for a particular application, including data collection and data structure design. ...
Preprint
Full-text available
Online products generate vast amounts of user feedback data, which has become crucial for companies to improve product quality and customer satisfaction. This paper proposes the FPQA-UFD (framework to analyze product quality based on user feedback data) using data mining algorithms, natural language processing, multi-classification methods, and statistical analysis, providing detailed data support for product development teams' decision-making. The framework effectively extracts information from user feedback, accurately dividing 305,311 user feedback data into 44 effective topics and extracting explanatory keywords. A multi-classification experiment achieved a classification accuracy and recall rate of 83%. This study offers valuable insights for businesses and academia to enhance decision-making and software development through user feedback analysis.
... Unlike the laborious and time-consuming manual classification, machine learning classification algorithms show great potential for automatic classification [28,33]. Machine learning and deep learning models have many applications for reliable automated systems in different domains [29,31,32]. ...
... Adding one saves it from being divided by zero. If a word comes more often it has low weight and if it comes less often, IDF assigns higher weight [28]. TF-IDF is calculated using ...
Article
Full-text available
Online media reshaped the news industry leading to information richness, timely dissemination, and immense diversity. In addition, recent technological advancements enable on-spot, prompt and frequent reporting which can be viewed on smartphones, personal computers, and mobile devices. These recent developments enhanced the importance of news categorization. Accurate news categorization has become an important element to increase user satisfaction by providing the news of their interest and desired category. Despite the available approaches for news categorization, such approaches lack the desired accuracy and require further research to improve their performance. For this purpose, this research proposes a hybrid model that comprises random forest (RF) and SoftMax regression. To further increase the accuracy, special emphasis is placed on preprocessing steps to remove the noise from the textual data. Moreover, term frequency-inverse document frequency (TF-IDF) and bag of words (BoW) approaches are leveraged for the proposed model due to their reported efficacy for the task at hand. Experimental results indicate that the proposed model achieves 98.1% accuracy and outperforms individual machine learning classifiers regarding the accuracy, precision, recall, and F1 score. Hybrid approaches of RF and SMR tend to show better results than individual, as well as, state-of-the-art approaches.
... For example, studies [23][24][25] classify app reviews by using machine learning and deep learning models. Another piece of research [26] looked at the Shopify app reviews and classified them as pleased or dissatisfied. For sentiment classification, many feature extraction approaches are used in conjunction with supervised machine learning algorithms. ...
... Table 13 shows the results of stateof-the-art studies. The study [26] used machine learning models for a sentiment analysis and LR performed well with 83% accuracy. Khalid et al. [27] performed an analysis on Twitter data using an ensemble of machine learning models and achieved 93% accuracy with the BBSVM model. ...
Article
Full-text available
Chatbots are AI-powered programs designed to replicate human conversation. They are capable of performing a wide range of tasks, including answering questions, offering directions, controlling smart home thermostats, and playing music, among other functions. ChatGPT is a popular AI-based chatbot that generates meaningful responses to queries, aiding people in learning. While some individuals support ChatGPT, others view it as a disruptive tool in the field of education. Discussions about this tool can be found across different social media platforms. Analyzing the sentiment of such social media data, which comprises people’s opinions, is crucial for assessing public sentiment regarding the success and shortcomings of such tools. This study performs a sentiment analysis and topic modeling on ChatGPT-based tweets. ChatGPT-based tweets are the author’s extracted tweets from Twitter using ChatGPT hashtags, where users share their reviews and opinions about ChatGPT, providing a reference to the thoughts expressed by users in their tweets. The Latent Dirichlet Allocation (LDA) approach is employed to identify the most frequently discussed topics in relation to ChatGPT tweets. For the sentiment analysis, a deep transformer-based Bidirectional Encoder Representations from Transformers (BERT) model with three dense layers of neural networks is proposed. Additionally, machine and deep learning models with fine-tuned parameters are utilized for a comparative analysis. Experimental results demonstrate the superior performance of the proposed BERT model, achieving an accuracy of 96.49%.
... The TF-IDF score of a word in a document is obtained by multiplying these two figures, TF and IDF. The more significant a term is in a given document, the higher the score it receives [70][71][72]. This method applies to bigrams, trigrams, and other structures besides unigrams. ...
Article
Background Text mining derives information and patterns from textual data. Online social media platforms, which have recently acquired great interest, generate vast text data about human behaviors based on their interactions. This data is generally ambiguous and unstructured. The data includes typing errors and errors in grammar that cause lexical, syntactic, and semantic uncertainties. This results in incorrect pattern detection and analysis. Researchers are employing various text mining techniques that can aid in Topic Modeling, the detection of Trending Topics, the identification of Hate Speeches, and the growth of communities in online social media networks. Objective This review paper compares the performance of ten machine learning classification techniques on a Twitter data set for analyzing users' sentiments on posts related to airline usage. Methods Review and comparative analysis of Gaussian Naive Bayes, Random Forest, Multinomial Naive Bayes, Multinomial Naive Bayes with Bagging, Adaptive Boosting (AdaBoost), Optimized AdaBoost, Support Vector Machine (SVM), Optimized SVM, Logistic Regression, and Long-Short Term Memory (LSTM) for sentiment analysis. Results The results of the experimental study showed that the Optimized SVM performed better than the other classifiers, with a training accuracy of 99.73% and testing accuracy of 89.74% compared to other models. Conclusion Optimized SVM uses the RBF kernel function and nonlinear hyperplanes to split the dataset into classes, correctly classifying the dataset into distinct polarity. This, together with Feature Engineering utilizing Forward Trigrams and Weighted TF-IDF, has improved Optimized SVM classifier performance regarding train and test accuracy. Therefore, the train and test accuracy of Optimized SVM are 99.73% and 89.74% respectively. When compared to Random Forest, a marginal of 0.09% and 1.73% performance enhancement is observed in terms of train and test accuracy and 1.29% (train accuracy) and 3.63% (test accuracy) of improved performance when compared with LSTM. Likewise, Optimized SVM, gave more than 10% of enhanced performance in terms of train accuracy when compared with Gaussian Naïve Bayes, Multinomial Naïve Bayes, Multinomial Naïve Bayes with Bagging, Logistic Regression and a similar enhancement is observed with AdaBoost and Optimized AdaBoost which are ensemble models during the experimental process. Optimized SVM also has outperformed all the classification models in terms of AUC-ROC train and test scores.