Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Microblogging services like Twitter and Facebook collect millions of user generated content every moment about trending news, occurring events, and so on. Nevertheless, it is really a nightmare to find information of interest through the huge amount of available posts that are often noise and redundant. In general, social media analytics services have caught increasing attention from both side research and industry. Specifically, the dynamic context of microblogging requires to manage not only meaning of information but also the evolution of knowledge over the timeline. This work defines Time Aware Knowledge Extraction (briefly TAKE) methodology that relies on temporal extension of Fuzzy Formal Concept Analysis. In particular, a microblog summarization algorithm has been defined filtering the concepts organized by TAKE in a time-dependent hierarchy. The algorithm addresses topic-based summarization on Twitter. Besides considering the timing of the concepts, another distinguish feature of the proposed microblog summarization framework is the possibility to have more or less detailed summary, according to the user's needs, with good levels of quality and completeness as highlighted in the experimental results.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This has led to extensive research in event analysis in microblogs which encompasses topic mining and event summarization. Existing works on summarizing an event based on the textual content of microblogs mostly provide a general factual summary [3,12,14,19,28] without considering users' emotional reactions. Yet, having a good understanding of the emotional responses of users is useful for policy makers. ...
... Event summarization in microblogs have mostly focused on generating a textual summary of the event [3,4,10,12,17,32] and ignored user reactions. Traditional summarization approaches utilize either k-means clustering [12,32] or detect volume peaks in microblogs [3,4,10] to identify subevents and pick the most representative microblogs to summarize the subevents. ...
... Event summarization in microblogs have mostly focused on generating a textual summary of the event [3,4,10,12,17,32] and ignored user reactions. Traditional summarization approaches utilize either k-means clustering [12,32] or detect volume peaks in microblogs [3,4,10] to identify subevents and pick the most representative microblogs to summarize the subevents. ...
Conference Paper
Full-text available
Microblogs have become the preferred means of communication for people to share information and feelings, especially for fast evolving events. Understanding the emotional reactions of people allows decision makers to formulate policies that are likely to be more well-received by the public and hence better accepted especially during policy implementation. However, uncovering the topics and emotions related to an event over time is a challenge due to the short and noisy nature of microblogs. This work proposes a weakly supervised learning approach to learn coherent topics and the corresponding emotional reactions as an event unfolds. We summarize the event by giving the representative microblogs and the emotion distributions associated with the topics over time. Experiments on multiple real-world event datasets demonstrate the effectiveness of the proposed approach over existing solutions.
... In this thesis, we develop a new approach to identify the topics and the corresponding emotional reactions of people as an event unfolds. Specifically, we design an event analysis framework called MOSAIC that comprises of three stages: (1) a trend interval detection algorithm to determine the granularity over the event timeline for discovering the hot topics while minimizing potential information loss, ( in microblogs which encompasses topic mining [1,2,3] and event summarization [4,5,6,7,8]. ...
... Existing works on summarizing an event based on the textual content of microblogs mostly provide a general factual summary [4,9,8,10,11] without considering users' emotional reactions. Yet, having an accurate understanding of the emotional responses of users as an event unfolds is critical to decision makers in adopting and implementing intervention strategies in a timely and appropriate manner. ...
... Several works [6,7,8] have assumed that peaks in the posting rate of microblogs are a good indicator of the emergence of subevents. A popular peak area detection algorithm called Offline Peak Detection (OPAD) algorithm [23] is generally used to identify the peak areas in microblogs posting rate. ...
Thesis
Full-text available
One aspect of crisis management involves the ability to understand the emotional reactions of people to adjust the response strategies. Uncovering the topics and emotions related to an event over time is critical for timely intervention. For fast-evolving events, microblogs tend to be the preferred means of communication for people to share information and feelings. In this thesis, we develop an event analysis framework to identify the topics and emotional reactions of people in microblogs. The framework has three components: (1) a trend interval detection algorithm to determine the granularity for discovering hot topics while minimizing potential information loss, (2) a weakly supervised learning approach to learn coherent topics and their corresponding emotional reactions, (3) an event summary generator that gives representative microblogs and emotion distributions associated with the topics over time. Extensive experiments on multiple real-world event datasets demonstrate the effectiveness of the proposed approach over existing solutions.
... The same study also evaluated the relationship between topics using seven dissimilarity measures and found that Kullback-Leibler and the Euclidean distances performed better in identifying related topics useful for user-based interactive approach. Similarly, extant research applying Time Aware Knowledge Extraction (TAKE) methodology [22] demonstrated methods to discover valuable information from huge amounts of information posted on Facebook and Twitter. The study used topic-based summarizing of Twitter data to explore content of research interest. ...
... where the gradient in (24) represents the difference betweenŷ and y multiplied by the corresponding input x j . Please note that in (22), we need to do the partial derivatives for all the values of x j where 1 ≤ j ≤ n. ...
... We observed on the test data with 70 items that, akin to the Naïve Bayes classification accuracy, shorter Tweets were classified using logistic regression with a greater degree of accuracy of just above 74%, and the classification accuracy decreased to 52% with longer Tweets. We calculated the Sensitivity of the classification test, which is given by the ratio of the number of correct positive predictions (22) in the output, to the total number of positives (35), to be 0.63 for the short Tweets, and 0.46 for the longer Tweets. We calculated the Specificity of the classification test, which is given by the ratio of the number of correct negative predictions (30) in the output, to the total number of negatives (35), to be 0.86 for the short Tweets, and 0.60 for the longer Tweets classification. ...
Article
Full-text available
Along with the Coronavirus pandemic, another crisis has manifested itself in the form of mass fear and panic phenomena, fueled by incomplete and often inaccurate information. There is therefore a tremendous need to address and better understand COVID-19's informational crisis and gauge public sentiment, so that appropriate messaging and policy decisions can be implemented. In this research article, we identify public sentiment associated with the pandemic using Coronavirus specific Tweets and R statistical software, along with its sentiment analysis packages. We demonstrate insights into the progress of fear-sentiment over time as COVID-19 approached peak levels in the United States, using descriptive textual analytics supported by necessary textual data visualizations. Furthermore, we provide a methodological overview of two essential machine learning (ML) classification methods, in the context of textual analytics, and compare their effectiveness in classifying Coronavirus Tweets of varying lengths. We observe a strong classification accuracy of 91% for short Tweets, with the Naïve Bayes method. We also observe that the logistic regression classification method provides a reasonable accuracy of 74% with shorter Tweets, and both methods showed relatively weaker performance for longer Tweets. This research provides insights into Coronavirus fear sentiment progression, and outlines associated methods, implications, limitations and opportunities.
... An example of a method that uses a fuzzy-based approach is the fuzzy logic with classic Zadeh's calculus of linguistically quantified propositions (Kacprzyk et al., 2008) which addresses trend extraction and real-time problems where the results are superior in t-norm evaluation, but weak in semantic problems because the semantic results of other t-norms are unclear and unclear can be understood. Fuzzy Formal Concept Analysis (Fuzzy FCA) (Maio et al., 2015) which addresses semantic and real time problems where the results excel at evaluations in f-measures with optimal recall and comparable precision. An example of a method that uses a machine learning approach is Incremental Short Text Summarization (IncreSTS) by (C. which has better outlier handling, high efficiency, and scalability on target problems. ...
... , (S et al., 2017), (Guo et al., 2019), (Dilawari and Khan, 2019), Unsupervised learning (Song et al., 2011), (Yousefi-azar and Hamey, 2017), (Tayal et al., 2016), (Alami et al., 2019), (Y. , , (Sun and Zhuge, 2018), (Zhou et al., 2016) Single document (Li et al., 2016), (Patel et al., 2019), (Sharifi et al., 2013), (Goyal et al., 2013), , (Cagliero et al., 2019) Multidocument (Malallah and Ali, 2017), (Patel et al., 2019), (Lee et al., 2013), (Padmapriya and Duraiswamy, 2014), (Fuad et al., 2019), (Khan et al., 2015a), (Khan et al., 2015b), (S et al., 2017), (Sanchez-gomez et al., 2018), , (Qiang et al., 2016), (Ansamma et al., 2017), (Widjanarko et al., 2018), (Azhari et al., 2018), (Sharifi et al., 2013), (Alzuhair and Al-dhelaan, 2019), (Liu et al., 2012), (Bian et al., 2013), (Yulianti et al., 2017), (Qiang et al., 2019), (Ketui et al., 2015), (Yan and Wan, 2015), (Baralis et al., 2015) Optimization (Song et al., 2011), (Abbasi-ghalehtaki et al., 2016), (Binwahlan et al., 2009a), (Khosravi et al., 2008), (Sanchez-gomez et al., 2018) Real-time (Maio et al., 2015), (Rodríguez-Vidal et al., 2019), (Chua and Asur, 2009), (Kacprzyk et al., 2008), (Kacprzyk et al., 2008), (Fu et al., 2015), (Chellal et al., 2016), Preprocessing is the initial process for preparing data. Unstructured data is changed to be structured data according to the needs for summation. ...
Article
Full-text available
Text summarization automatically produces a summary containing important sentences and includes all relevant important information from the original document. One of the main approaches, when viewed from the summary results, are extractive and abstractive. An extractive summary is heading towards maturity and now research has shifted towards abstractive summation and real-time summarization. Although there have been so many achievements in the acquisition of datasets, methods, and techniques published, there are not many papers that can provide a broad picture of the current state of research in this field. This paper provides a broad and systematic review of research in the field of text summarization published from 2008 to 2019. There are 85 journal and conference publications which are the results of the extraction of selected studies for identification and analysis to describe research topics/trends, datasets, preprocessing, features, techniques, methods, evaluations, and problems in this field of research. The results of the analysis provide an in-depth explanation of the topics/trends that are the focus of their research in the field of text summarization; provide references to public datasets, preprocessing and features that have been used; describes the techniques and methods that are often used by researchers as a comparison and means for developing methods. At the end of this paper, several recommendations for opportunities and challenges related to text summarization research are mentioned.
... The same study also evaluated the relationship between topics using seven dissimilarity measures and found that Kullback-Leibler and the Euclidean distances performed better in identifying related topics useful for user-based interactive approach. Similarly, extant research applying Time Aware Knowledge Extraction (TAKE) methodology [24] demonstrated methods to discover valuable information from huge amounts of information posted on Facebook and Twitter. The study used topic based summarization of Twitter data to explore content of research interest. ...
... where the gradient in (24) represents the difference betweenŷ and y multiplied by the corresponding input x j . Note that in (22), we need to do the partial derivatives for all the values of x j where 1 ≤ j ≤ n. ...
... The same study also evaluated the relationship between topics using seven dissimilarity measures and found that Kullback-Leibler and the Euclidean distances performed better in identifying related topics useful for user-based interactive approach. Similarly, extant research applying Time Aware Knowledge Extraction (TAKE) methodology [24] demonstrated methods to discover valuable information from huge amounts of information posted on Facebook and Twitter. The study used topic based summarization of Twitter data to explore content of research interest. ...
... where the gradient in (24) represents the difference betweenŷ and y multiplied by the corresponding input x j . Note that in (22), we need to do the partial derivatives for all the values of x j where 1 ≤ j ≤ n. ...
Preprint
Full-text available
Along with the Coronavirus pandemic, another crisis has manifested itself in the form of mass fear and panic phenomena, fuelled by incomplete and often inaccurate information. There is therefore a tremendous need to address and better understand COVID-19's informational crisis and gauge public sentiment, so that appropriate messaging and policy decisions can be implemented. In this research article, we identify public sentiment associated with the pandemic using Coronavirus specific Tweets and R statistical software, along with its sentiment analysis packages. We demonstrate insights into the progress of fear-sentiment over time as COVID-19 approached peak levels in the United States, using descriptive textual analytics supported by necessary textual data visualizations. Furthermore, we provide a methodological overview of two essential machine learning classification methods, in the context of textual analytics, and compare their effectiveness in classifying Coronavirus Tweets of varying lengths. We observe a strong classification accuracy of 91% for short Tweets, with the Naïve Bayes method. We also observe that the logistic regression classification method provides a reasonable accuracy of 74% with shorter Tweets, and both methods showed relatively weaker performance for longer Tweets. This research provides insights into Coronavirus fear sentiment progression, and outlines associated methods, implications, limitations and opportunities.
... Rodriguez [15] implemented a knowledge extraction system, oriented to unstructured text, to generate semantic knowledge based on entity and relationship extraction. Maio [16] defined a time-aware method that can generate topic-based digest from the discrete short texts in Twitter. Morgan [17] proposed a topic classification method based on maximum entropy for tracking and detecting topic letters in microblogs. ...
... This algorithm firstly calculates the distribution of word W among different categories (NCD), which is defined by using information entropy shown in Eq. (16). ...
Article
Full-text available
“Pattern” can always help machine to recognize the new encounters, so does “requirement pattern.” Requirement pattern is one of the essences for the cognitive service to understand customer’s intention. Since crowdsourcing service platform holds abundant user demands in the form of text information, the method proposed in this paper aims at eliciting valuable patterns from this “treasure.” This method is based on a knowledge graph, which is constructed with the refined concepts of those text information of several different domains. Due to the irregularity and difference of user demand expressions, this paper will firstly explain the knowledge extraction method for heterogeneous text and the knowledge fusion-based knowledge graph construction method. Afterward, we will introduce the requirement pattern elicitation method based on this knowledge graph. The pattern could either be a frequent demand sequence or a domain-oriented rule or link. Finally, this paper will demonstrate a case study to show how those patterns can help to understand customers’ intention effectively and accurately.
... Given the considerable number of publications on SSMED and the restrictive filters (such as journals, time of publication, etc.) adopted by the authors of the extant literature reviews (as reported in Paragraph 2.3), which eventually changed or even reversed some findings, the authors implemented an alternative, innovative, extensive SSMED as a multidisciplinary science: structure of disciplines (how SSMED is made) SSMED Source: Authors' elaboration and quantitative review process, called Semiautomatic Literature Review (SALR), and it was conducted using integrated techniques for knowledge extraction. Specifically, the work adopted a methodology known as 'Time Aware Knowledge Extraction' (TAKE), introduced by De Maio, Fenza, Loia, and Parente (2016) to analyse unstructured data in the Information Technology field. Moreover, qualitative methods to fully answer to the two RQs were also adopted. ...
... This work is based on an innovative technique, TAKE, originally proposed by De Maio et al. (2016) for purposes different from those pursued in this SALR but related to the analysis of temporal and conceptual data with unstructured content. TAKE consists of 3 steps, as shown in Figure 2. Narrative Literature Review (NLR) The Narrative Literature Review (NLR) offers an overview of a given topic and generally addresses different aspects, providing answers to broad and generic questions that investigate the entire context of a certain topic and aim to generate a basic understanding (Green, Johnson, & Adams, 2006). ...
Article
Full-text available
SSMED (Service Science, Management Engineering and Design) is multidisciplinary by nature. However, some authors stated that SSMED publications remain focused on single scientific domains. This paper proposes a Semiautomatic Literature Review (SALR) using integrated techniques for knowledge extraction –‘Time Aware Knowledge Extraction’ (TAKE) – to analyse the interdisciplinarity of SSMED publications and the potential for transdisciplinarity based on the actual adoption of Service-Dominant Logic as the foundation of SSMED research. Findings reveal that: 1) most SSMED publications are not interdisciplinary and are mainly related to Management; 2) Service-Dominant Logic has been adopted very often in SSMED publications, paving the way for SSMED transdisciplinarity. This paper offers theoretical and practical insights by enhancing the knowledge about SSMED literature and enriching the state of the art related to techniques to perform literature reviews. Furthermore, it stimulates the expansion of scholars’ and managers’ views of holistic approaches to service systems while fostering SSMED viability.
... The extraction process is based on the extended temporal concept analysis and description logic, to reason on semantically represented tweets streams. A microblog summarization algorithm has been defined in [18], filtering the concepts organized by a time aware knowledge extraction method in a time-dependent hierarchy. ...
... Sensors 2018,18, 2117 ...
Article
Full-text available
In settings wherein discussion topics are not statically assigned, such as in microblogs, a need exists for identifying and separating topics of a given event. We approach the problem by using a novel type of similarity, calculated between the major terms used in posts. The occurrences of such terms are periodically sampled from the posts stream. The generated temporal series are processed by using marker-based stigmergy, i.e.; a biologically-inspired mechanism performing scalar and temporal information aggregation. More precisely, each sample of the series generates a functional structure, called mark, associated with some concentration. The concentrations disperse in a scalar space and evaporate over time. Multiple deposits, when samples are close in terms of instants of time and values, aggregate in a trail and then persist longer than an isolated mark. To measure similarity between time series, the Jaccard’s similarity coefficient between trails is calculated. Discussion topics are generated by such similarity measure in a clustering process using Self-Organizing Maps, and are represented via a colored term cloud. Structural parameters are correctly tuned via an adaptation mechanism based on Differential Evolution. Experiments are completed for a real-world scenario, and the resulting similarity is compared with Dynamic Time Warping (DTW) similarity.
... In recent years, in politics, emergency response, and business, it has become important to account for influential social network nodes in order to influence and reach as many people as possible [6]. As such, several studies have aimed to identify the influentials within a society or a community for commercial, political or economic purposes [7][8][9][10][11][12]. Bennett [13] investigated the ability to influence and then take common actions, while [14] studied the logic of the connections and relationships with human beings. ...
... Information propagation in Twitter is dependent on several factors, such as the number of followers, network topology, user influence, time of day, and events such as elections or natural disasters, among nu- merous others [1,10,11,[23][24][25]. The influence of a user will determine how rapidly and far a tweet will reach other users, but also how long that tweet will survive on the net. ...
Article
Over the past several years, social networks have become a major channel for information delivery. At present, social networks are being used to obtain more followers and exert influence over people during political campaigns. However, the propagation of a social network post is dependent on numerous factors. Some of these are known; for example, the post contents, the time when it was posted, and the person or entity by whom it was posted. However, other factors remain unknown, such as what makes a post more successful than others, and how posts from similar profiles evolve and propagate differently over time. The main subject of this work is addressing these types of questions. Our approach relies on a three-fold methodology for studying the influence and propagation of posts: graph-based, semantic, and contrast pattern recognition analysis. The results obtained are complemented by a dynamic visualization that encompasses all of the variables involved. In order to corroborate our results, we collected all posts from the Twitter accounts of the most prominent Mexican political figures and analyzed the influence and propagation of each post issued.
... (12) Ease of use: the degree to which the use of the method by individuals is free of effort.(13) Consistency: the degree of uniformity, standardization, and freedom from contradiction among the elements of the structure of the method.(14) Utility: measures the value of achieving the method's goal, i.e., the difference between the worth of achieving this goal and the price paid for achieving it. ...
Article
Full-text available
The Internet of Things massive adoption in many industrial areas in addition to the requirement of modern services is posing huge challenges to the field of data mining. Moreover, the semantic interoperability of systems and enterprises requires to operate between many different formats such as ontologies, knowledge graphs, or relational databases, as well as different contexts such as static, dynamic, or real time. Consequently, supporting this semantic interoperability requires a wide range of knowledge discovery methods with different capabilities that answer to the context of distributed architectures (DA). However, to the best of our knowledge there is no general review in recent time about the state of the art of Concept Analysis (CA) and multi-relational data mining (MRDM) methods regarding knowledge discovery in DA considering semantic interoperability. In this work, a systematic literature review on CA and MRDM is conducted, providing a discussion on the characteristics they have according to the papers reviewed, supported by a clusterization technique based on association rules. Moreover, the review allowed the identification of three research gaps toward a more scalable set of methods in the context of DA and heterogeneous sources.
... Moreover, the results improved by 8% than the current solution classifying running speed conditions using a single wearable sensor in the context of elderly in smart homes [80]. The wearable sensors and feature extraction algorithm are used for accurate monitoring of speed conditions [74]. Moreover, MATLAB software and wearable sensors are equipped for monitoring the healthcare parameters of the elderly body [48]. ...
Article
Full-text available
The growing elderly population in smart home environments necessitates increased remote medical support and frequent doctor visits. To address this need, wearable sensor technology plays a crucial role in designing effective healthcare systems for the elderly, facilitating human–machine interaction. However, wearable technology has not been implemented accurately in monitoring various vital healthcare parameters of elders because of inaccurate monitoring. In addition, healthcare providers encounter issues regarding the acceptability of healthcare parameter monitoring and secure data communication within the context of elderly care in smart home environments. Therefore, this research is dedicated to investigating the accuracy of wearable sensors in monitoring healthcare parameters and ensuring secure data transmission. An architectural framework is introduced, outlining the critical components of a comprehensive system, including Sensing, Data storage, and Data communication (SDD) for the monitoring process. These vital components highlight the system's functionality and introduce elements for monitoring and tracking various healthcare parameters through wearable sensors. The collected data is subsequently communicated to healthcare providers to enhance the well-being of elderly individuals. The SDD taxonomy guides the implementation of wearable sensor technology through environmental and body sensors. The proposed system demonstrates the accuracy enhancement of healthcare parameter monitoring and tracking through smart sensors. This study evaluates state-of-the-art articles on monitoring and tracking healthcare parameters through wearable sensors. In conclusion, this study underscores the importance of delineating the SSD taxonomy by classifying the system's major components, contributing to the analysis and resolution of existing challenges. It emphasizes the efficiency of remote monitoring techniques in enhancing healthcare services for the elderly in smart home environments.
... Initially, single-document summarisation was introduced. In this method, all the detailed data present in a single document were summarised and converted into a short data set (De Maio et al., 2016). Later on, multi-document summarisation methods gained popularity. ...
Article
Full-text available
This study compared the salient features of the three basic types of automatic text summarisation methods (ATSMs)—extractive, abstractive, and real-time—along with the available approaches used for each type. The data set comprised 12 reports on the current issues on automatic text summarisation methods and techniques across languages, with a special focus on Arabic whose structure has been largely claimed to be problematic in most ATSMs. Three main summarizers were compared: TAAM, OTExtSum, and OntoRealSumm. Further to this, a humanoid version of the summary of the data set was prepared, and then compared to the automatically generated summary. A 10-item questionnaire was built to help with the assessment of the target ATSMs. Also, Rouge analysis was performed to assess the efficacy of all techniques in minimising the redundancy of the data set. Findings showed that the precision of the target summarizers differed considerably, as 80% of the data set has been proven to be aware of the problems underlying ATSMS. The remaining parameters were in the normal range (65–75%). In light of the equations-based assessment of ATSMS, the highest range was noted with the removal of stop word, the least range was noted with POS tagging, stem weight, and stem collection. Regarding Arabic, the statistical analysis has been proven to be the most effective summarisation method (accuracy = 57.59%; reminiscence = 58.79%; F-Value = 57.99%). Further research is required to explore how the lexicogrammatical nature of languages and generic text structure would affect the text summarisation process.
... Machine Learning and Deep Learning classifiers play an important role to classify the information, short text logistic and Naive Bayes classifiers provide accuracy up to 74 and 91 percent respectively but for the long text these models' performance is not considerable [16]. Nowadays researchers have adopted Time-Aware Knowledge Extraction (TAKE) [17] methodology to identify useful information from a large amount of information that is posted on Twitter and Facebook. Twitter data has been To predict the future, a machine learning and cloud computing-based model was developed by researchers in May 2020 [22]. ...
Article
Full-text available
Psychologists and Social scientists are interested to evaluate how people show their expressions and sentiments about natural disasters, terrorism, and pandemic situations. The covid-19 has raised the number of psychological issues such as depression due to social changes and employment issues. The everyday life of people is disturbed due to the Pandemic situation of covid-19. During the lockdown, people share their opinions on social sites like Twitter and Facebook. Due to this pandemic situation and lockdown, the emotions of people are different, the emotions are categorized as fear, anger, joy, and sad in terms of covid-19 and lockdown. In this paper, we have used machine learning and Natural Language Processing approaches to design an effective machine learning model for the classification of people's emotions related to covid-19. The early detection of sentiment allows for better handling of the pandemic situation and government policies. Text is categorized into fear, joy, anger, and sad sentiment classes. We have proposed a deep learning-based LSTM model for Covid-19 related emotion identification and achieved an accuracy of 71.7% with the proposed model. For the robustness of the proposed model, we considered several machine learning classifiers and compare these classifiers with our proposed model. Data Availability: In this study, an open-source dataset is used: https://www.kaggle.com/code/poulamibakshi/covid-19-sentiment-analysis/data
... where I corresponds to the score extracted by Semantic Annotation phase, indicating the triple fuzzy relationships between users U , topics U RIs and time T [14]. Example 2. Let us assume to have a user's tweet stream between t 1 and t 2 , including 6 users U = u 1 , u 2 , . . . ...
... Tweet contents are user-generated data in a form of text, image, audio or video. Furthermore, tweet contents can include other types of data such as hashtags and a mention to other account [5][6][7][8]. ...
Method
Full-text available
The main objective of this study is to propose and develop a Deep Learning based model for sentiment analysis using data extracted from twitter
... The triadic timed FCA for users' post contents is composed of three dimensions, i.e., users, topics (linguistic terms extracted from tweets' content in the semantic representation phase), and time (i.e., objects, attributes, and condition) TFC = (U , URIs, T , I ) in which I indicates the triple fuzzy relationships (De Maio et al. 2016) belonging to [0, 1] among user U , topics URIs, and time T . ...
Article
Full-text available
Advertising is becoming a business on social networks. Billions of people around the world use social media, and fastly, it has become one of the defining technologies of our time. Social platforms like Twitter are one of the primary means of communication and information dissemination and can capture the interest of potential customers. Therefore, it is crucial to select suitable advertisements to users in specific times and locations for capturing their attention, profitably. In this paper, we propose a context-aware advertising recommendation system that, by analyzing the users’ tweets and movements along a timeline, infers the personal interests of users and provides attractive ads to users through the triadic formal concept analysis theory.
... The same study also evaluated the relationship between topics using seven dissimilarity measures and found that Kullback-Leibler and the Euclidean distances performed better in identifying related topics useful for user-based interactive approach. Similarly, extant research applying Time Aware Knowledge Extraction (TAKE) methodology [25] demonstrated methods to discover valuable information from huge amounts of information posted on Facebook and Twitter. The study used topic based summarization of Twitter data to explore content of research interest. ...
Preprint
Full-text available
Along with the Coronavirus pandemic, another crisis has manifested itself in the form of mass fear and panic phenomena, fueled by incomplete and often inaccurate information. There is therefore a tremendous need to address and better understand COVID-19's informational crisis and gauge public sentiment, so that appropriate messaging and policy decisions can be implemented. In this research article, we identify public sentiment associated with the pandemic using Coronavirus specific Tweets and R statistical software, along with its sentiment analysis packages. We demonstrate insights into the progress of fear-sentiment over time as COVID-19 approached peak levels in the United States, using descriptive textual analytics supported by necessary textual data visualizations. Furthermore, we provide a methodological overview of two essential machine learning (ML) classification methods, in the context of textual analytics, and compare their effectiveness in classifying Coronavirus Tweets of varying lengths. We observe a strong classification accuracy of 91\% for short Tweets, with the Na\"ive Bayes method. We also observe that the logistic regression classification method provides a reasonable accuracy of 74\% with shorter Tweets, and both methods showed relatively weaker performance for longer Tweets. This research provides insights into Coronavirus fear sentiment progression, and outlines associated methods, implications, limitations and opportunities.
... The same study also evaluated the relationship between topics using seven dissimilarity measures and found that Kullback-Leibler and the Euclidean distances performed better in identifying related topics useful for user-based interactive approach. Similarly, extant research applying Time Aware Knowledge Extraction (TAKE) methodology [25] demonstrated methods to discover valuable information from huge amounts of information posted on Facebook and Twitter. The study used topic based summarization of Twitter data to explore content of research interest. ...
Preprint
Full-text available
Along with the Coronavirus pandemic, another crisis has manifested itself in the form of mass fear and panic phenomena, fueled by incomplete and often inaccurate information. There is therefore a tremendous need to address and better understand COVID-19's informational crisis and gauge public sentiment, so that appropriate messaging and policy decisions can be implemented. In this research article, we identify public sentiment associated with the pandemic using Coronavirus specific Tweets and R statistical software, along with its sentiment analysis packages. We demonstrate insights into the progress of fear-sentiment over time as COVID-19 approached peak levels in the United States, using descriptive textual analytics supported by necessary textual data visualizations. Furthermore, we provide a methodological overview of two essential machine learning (ML) classification methods, in the context of textual analytics, and compare their effectiveness in classifying Coronavirus Tweets of varying lengths. We observe a strong classification accuracy of 91% for short Tweets, with the Naive Bayes method. We also observe that the logistic regression classification method provides a reasonable accuracy of 74% with shorter Tweets, and both methods showed relatively weaker performance for longer Tweets. This research provides insights into Coronavirus fear sentiment progression, and outlines associated methods, implications, limitations and opportunities.
... A feasible security system model for the smart city environment was studied by Zhou and Luo (2017) using the combination of fuzzy logic and entropy weight methods. De Maio et al. (2016) proposed the Time Aware Knowledge Extraction (TAKE) which applies the concept of fuzzy formal concept analysis to extract useful information from the emerging data on the social network platforms. Zeng et al. (2017) analysed the variations in the attribute values of the fuzzy rough approximations. ...
Article
Over the last few years, Big Data has gained a tremendous attention from the research community. The data being generated in huge quantity from almost every field is unstructured and unprocessed. Extracting knowledge base and useful information from the big raw data is one of the major challenges, present today. Various computational intelligence and soft computing techniques have been proposed for efficient big data analytics. Fuzzy techniques are one of the soft computing approaches which can play a very crucial role in current big data challenges by pre-processing and reconstructing data. There is a wide spread application domains where traditional fuzzy sets (type-1 fuzzy sets) and higher order fuzzy sets (type-2 fuzzy sets) have shown remarkable outcomes. Although, this research domain of “fuzzy techniques in Big Data” is gaining some attention, there is a strong need for a motivation to encourage researchers to explore more in this area. In this paper, we have conducted bibliometric study on recent development in the field of “fuzzy techniques in big data”. In bibliometric study, various performance metrics including total papers, total citations, and citation per paper are calculated. Further, top 10 of most productive and highly cited authors, discipline, source journals, countries, institutions, and highly influential papers are also evaluated. Later, a comparative analysis is performed on the fuzzy techniques in big data after analysing the most influential works in this field.
... En Latinoamérica se contabilizan pocos hallazgos (n=3). En relación al tamaño de la muestra, los estudios demuestran que es posible realizar experimentos con pequeñas o grandes muestras de datos (Han, Pei, y Kamber, 2011), que pueden ir desde 8 hasta 25 000 usuarios (Bayne, 2015;Kuznetsov et al., 2015;Pill et al., 2017); así como de 165 a más de un millón de tweets (De Maio et al., 2016;Kimmons et al., 2017); estos hallazgos indican que los tamaños de la muestra son variables en los experimentos. ...
... En Latinoamérica se contabilizan pocos hallazgos (n=3). En relación al tamaño de la muestra, los estudios demuestran que es posible realizar experimentos con pequeñas o grandes muestras de datos (Han, Pei, y Kamber, 2011), que pueden ir desde 8 hasta 25 000 usuarios (Bayne, 2015;Kuznetsov et al., 2015;Pill et al., 2017); así como de 165 a más de un millón de tweets (De Maio et al., 2016;Kimmons et al., 2017); estos hallazgos indican que los tamaños de la muestra son variables en los experimentos. ...
... En Latinoamérica se contabilizan pocos hallazgos (n=3). En relación al tamaño de la muestra, los estudios demuestran que es posible realizar experimentos con pequeñas o grandes muestras de datos (Han, Pei, y Kamber, 2011), que pueden ir desde 8 hasta 25 000 usuarios (Bayne, 2015;Kuznetsov et al., 2015;Pill et al., 2017); así como de 165 a más de un millón de tweets (De Maio et al., 2016;Kimmons et al., 2017); estos hallazgos indican que los tamaños de la muestra son variables en los experimentos. ...
Article
Full-text available
En los últimos años, existe un creciente interés por los actores de la educación en la inclusión de las TIC en sus instituciones, como es el caso de las redes sociales, que lejos de ser un problema y mediante un uso guiado de las mismas, permiten innovar las sesiones de clases tradicionales y mejorar la comunicación entre docentes y estudiantes. En el presente estudio se plantearon dos objetivos: (1) realizar una revisión sistemática de la literatura, mediante la búsqueda de artículos publicados entre Enero/2007 y Marzo/2019, en bases de datos como ACM, IEEE, ScienceDirect, Springer, entre otras, para identificar las investigaciones que han aplicado técnicas de minería de datos, para la extracción y análisis de datos de Twitter en la educación superior; y, (2) destacar las prácticas pedagógicas que han incorporado Twitter y minería de datos para mejorar los procesos educativos. De los 315 artículos obtenidos, fueron seleccionados 65 que cumplieron con los criterios de inclusión. Los principales resultados indican que: (1) las técnicas de minería de datos más utilizadas son predictivas con tareas de clasificación; (2) Twitter se usa principalmente para: (a) determinar percepción estudiantil; (b) compartir información, material y recursos; (c) generar comunicación y participación; (d) fomentar habilidades; y (e) mejorar la expresión oral y el rendimiento académico; (3) Estados Unidos es el país con mayor número de trabajos; sin embargo, en países de Latinoamérica los hallazgos son pocos, por lo que, se apertura un campo de investigación en esta región; y (4) los estudios incluyeron modelos, métodos, estrategias, teorías o instrumentos como práctica pedagógica; de modo que, no existe un consenso en la forma en que los datos extraídos de Twitter podrían ser incorporados en la educación superior para mejorar los procesos de enseñanza y aprendizaje.
... Motivated by these concerns, this paper introduces a topic relation identification methodology after applying topic modeling to massive scientific literature. In past five years, topic modelbased approaches have attracted increasing interest in bibliometrics [8], [16], [17]. Wei and Croft [18] have demonstrated that topic models outperform most cluster-based approaches in information retrieval. ...
Article
Over the past five years, topic models have been applied to bibliometrics research as an efficient tool for discovering latent and potentially useful content. The combination of topic modeling algorithms and bibliometrics has generated new challenges of interpreting and understanding the outcome of topic modeling. Motivated by these new challenges, this paper proposes a systematic methodology for topic analysis in scientific literature corpora to face the concerns of conducting post topic modeling analysis. By linking the corpus metadata with the discovered topics, we feature them with a number of topic-based analytic indices to explore their significance, developing trend, and received attention. A topic relation identification approach is then presented to quantitatively model the relations among the topics. To demonstrate the feasibility and effectiveness of our methodology, we present two case studies, using big data and dye-sensitized solar cell publications derived from searches in World of Science. Possible application of the methodology in telling good stories of a target corpus is also explored to facilitate further research management and opportunity discovery.
... To model the social network data Kundu and Pal (2015) proposed a novel technique called Fuzzy Granular Social Network (FGSN), with the combination of granular computing and fuzzy neighborhood approach. To tackle big data veracity, fuzzy formal concept analysis was used to handle the existence of noisy and redundant data in De Maio et al. (2016). In Ramachandramurthy et al. (2015), authors used the fuzzy Bayesian inference for the refinement of the dataset. ...
Article
Within the aspect of big data, veracity refers to the existing uncertainty in the dataset. The continuous flow of unstructured data with unwanted noise may bring abnormality in the dataset making them unusable. In this paper, we propose a novel method to handle the veracity characteristic of the big data using the concept of footprint of uncertainty (FOU) in interval type-2 fuzzy sets (IT2 FSs). The proposed method helps in handling the veracity issue in big data and reduces the instances to a manageable extent. We have compared the results with the existing clustering based methods and examined the relationship between the clusters and the FOUs by comparing their centroids and defuzzified values. To scrutinize the validity of our results, we have also performed a number of additional experiments by appending extra instances to the datasets. To check its consistency and efficacy, the proposed methodology is assessed from three different aspects. Experimental result validates that the proposed method can suitably handle the veracity issue in big datasets and is efficient in reducing the instances.
... The topic modeling techniques use either dynamic topic modeling [24], hierarchical topic modeling [7] or keyword-based clustering [25]. The sub-event detection based techniques extract different sub-events from tweet streams through several features like tweet bursts [26], temporal changes in the tweet vocabulary set [27], [28], [29], [30], as well as information from external sources [31], [32]. Generative models like Hidden Markov Models that consider both burstiness and word distribution of the tweets [33] and Hierarchical Dirichlet Process based topic modeling have also been proposed [34]. ...
Article
Twitter has become an essential platform for the news media sources to disseminate news. The opinions expressed through Twitter can be mined by news media sources to obtain users' reactions centered around different news articles. A comprehensive summary of the users' reactions with respect to a news article can be crucial due to various reasons like: 1) understanding the sensitivity/importance of the news; 2) obtaining insights about the diverse opinions of the readers with respect to the news; and 3) understanding the key aspects that draw the interest of the readers. However, the selected summary tweets must fulfill multiple objectives, like relevance to the news article, diversity among the selected tweets, and should cover the entire spectrum of opinions expressed through the tweets. Existing methods primarily attempt to identify a set of relevant tweets from which the summary tweets are selected that maintains the diversity and coverage requirements. However, the noise and the nontemporal behavior of the article-specific tweets make the identification of such relevant tweets extremely difficult, resulting in poor summary quality. In this paper, through empirical investigations, we show that initially identifying the diverse opinions can lead to better identification of the relevant tweets, i.e., following a specific ordering of the objectives can lead to the improved summary. We, subsequently, propose a tweet summarization technique that follows such a specific ordering. Validation of our proposed approach for 800 news articles with 2.1 billion related tweets shows that the proposed approach produces 11.6%-34.8% improvement in summary quality as compared to existing state-of-the-art techniques.
... Most publications rely on general-purpose techniques from traditional text summarization along with redundancy detection methods to avoid the repetition of contents in the summary (Inouye and Kalita 2011;Takamura et al. 2011). Social network specific signals (such as user connectivity and activity (Liu et al. 2012) and time-based features (Alsaedi et al. 2016;De Maio et al. 2016;He et al. 2017)) have also been widely exploited. ...
Article
Full-text available
Producing online reputation summaries for an entity (company, brand, etc.) is a focused summarization task with a distinctive feature: issues that may affect the reputation of the entity take priority in the summary. In this paper we (i) present a new test collection of manually created (abstractive and extractive) reputation reports which summarize tweet streams for 31 companies in the banking and automobile domains; (ii) propose a novel methodology to evaluate summaries in the context of online reputation monitoring, which profits from an analogy between reputation reports and the problem of diversity in search; and (iii) provide empirical evidence that producing reputation reports is different from a standard summarization problem, and incorporating priority signals is essential to address the task effectively.
... Meanwhile, there are real-life problems, in which massive parallelization of computations on Apache Hadoop or Spark, and the use of scalable environments, like the Cloud, brought significant improvements in performance of data processing and analysis. Big data challenge was observed and solved in various works devoted to intelligent transport and smart cities [11,19,42,43,74,75,84], water monitoring [12,22,90], social networks analysis [13,14,77], multimedia processing [72,82], internet of things (IoT) [9], social media monitoring [50], Life sciences [3,31,32,44,58,69] and disease data analysis [6,45,81], telecommunication [27], and finance [2], to mention just a few. Many hot issues in various sub-fields of bioinformatics were also solved with the use of Big Data ecosystems and Cloud computing, e.g., mapping nextgeneration sequence data to the human genome and other reference genomes, for use in a variety of biological analyzes including SNP discovery, genotyping and personal genomics [65], sequence analysis and assembly [17,30,34,35,47,62], multiple alignments of DNA and RNA sequences [86,91], codon analysis with local MapReduce aggregations [63], NGS data analysis [8], phylogeny [24,48], proteomics [37], analysis of proteinligand binding sites [23], and others. ...
Article
Full-text available
Intrinsically disorder proteins (IDPs) constitute a significant part of proteins that exist and act in cells of living organisms. IDPs play key roles in central cellular processes and some of them are closely related to various human diseases, like cancer or neurodegenerative disorders. Identification of IDPs and studying their structural characteristics have become an important part of structural bioinformatics and structural genomics. However, growing amount of genomic and protein sequences in public repositories pose a pressure on existing methods for identification of IDPs. Large volumes of protein amino acid sequences need to be analyzed in terms of propensity to form disordered regions, and this task requires novel tools and scalable platforms to cope with this big biological data challenge. In this paper, we show how the identification of disordered regions of 3D protein structures can be efficiently accelerated with the use of Apache Spark cluster established and scaled on the public Cloud. For this purpose, we propose Spark-based meta-predictor (Spark-IDPP), which enables efficient prediction of disordered regions of proteins on a large-scale. Results of our performance tests show that, for large data sets, our method achieves almost linear speedup, when scaling out the computations on the 32-node Spark cluster located in the Azure cloud. This proves that through appropriate partitioning of data and by increasing the degree of parallelism, we can significantly improve efficiency of IDP predictions. Additionally, by using several basic predictors, aggregating their ranks in various consensus modes, and filtering the final outcome with a dedicated fuzzy filter, the Spark-IDPP increases the quality of predictions.
... We observe these posts in a time interval, divided into temporal windows (say for example, morning, afternoon and evening). By exploiting known techniques to extract concepts from these tweets [3], [4], we are able to give a semantic value to these tweets in order to get a measure of similarity among them. Then we formally model users, advertisements, time intervals, temporal windows and this similarity measure through rough sets to get a set of the most adapt advertisements to a given user in a particular temporal window within the time interval under observation. ...
... Meanwhile, there are real-life problems, in which massive parallelization of computations on Apache Hadoop or Spark, and the use of scalable environments, like the Cloud, brought significant improvements in performance of data processing and analysis. Big data challenge was observed and solved in various works devoted to intelligent transport and smart cities [11,19,42,43,74,75,84], water monitoring [12,22,90], social networks analysis [13,14,77], multimedia processing [72,82], internet of things (IoT) [9], social media monitoring [50], Life sciences [3,31,32,44,58,69] and disease data analysis [6,45,81], telecommunication [27], and finance [2], to mention just a few. Many hot issues in various sub-fields of bioinformatics were also solved with the use of Big Data ecosystems and Cloud computing, e.g., mapping nextgeneration sequence data to the human genome and other reference genomes, for use in a variety of biological analyzes including SNP discovery, genotyping and personal genomics [65], sequence analysis and assembly [17,30,34,35,47,62], multiple alignments of DNA and RNA sequences [86,91], codon analysis with local MapReduce aggregations [63], NGS data analysis [8], phylogeny [24,48], proteomics [37], analysis of proteinligand binding sites [23], and others. ...
Chapter
Intrinsically disordered proteins (IDPs) constitute a wide range of molecules that act in cells of living organisms and mediate many protein–protein interactions and many regulatory processes. Computational identification of disordered regions in protein amino acid sequences, thus, became an important branch of 3D protein structure prediction and modeling. In this chapter, we will see the IDP meta-predictor that applies an ensemble of primary predictors in order to increase the quality of IDP prediction. We will also see the highly scalable implementation of the meta-predictor on the Spark cluster (Spark-IDPP) that mitigates the problem of the exponentially growing number of protein amino acid sequences in public repositories. Spark-IDPP responds very well to the current needs of IDP prediction by parallelizing computations on the Spark cluster that can be scaled on demand on the Microsoft Azure cloud according to particular requirements for computing power.
... • social networks [6,7,8,11,13,14,31,32], • multimedia processing [29], • internet of things (IoT) [3], • intelligent transport [15,12,30,16], • medicine and bioinformatics [22,21,20,24], • finance [1], and many others [19,33]. ...
Chapter
Scientific solutions presented in this book rely on various technologies that emerged in computer science. Some of them emerged recently and are quite new in the bioinformatics field. Some of them are widely used in developing efficient and reliable IT systems supporting various forms of business for many years, but are not frequently used in bioinformatics. This chapter provides a technological road map for solutions presented in this book. It covers a brief introduction to the concept of cloud computing, cloud service, and deployment models. It also defines the Big Data challenge and presents benefits of using multi-threading in scientific computations. It then explains graphics processing units (GPU) and CUDA architecture. Finally, it focuses on relational databases and the SQL language used for declarative querying.
... Because of the excellent performance of knowledge representation and extraction in large volumes of unstructured data [54] , FCA is widely used in the fields like knowledge discovery, ontology learning [55] , information retrieval and recommender system [56] to extract useful information and to construct a knowledge graph for data organization and visualization [57] . ...
Article
Activity recognition is one of the most important prerequisites for smart home applications. It is a challenging topic due to the high requirements for reliable data acquisition and efficient data analysis. Besides, the heterogeneous layouts of smart homes, the number of residents and varied human behavioral patterns also aggravate the complexity of recognition. Therefore, most human activity recognition systems are based on an unrealistic assumption that there is only one resident performing activities. In this paper, we investigate the issue of multi-resident activity recognition and propose a knowledge-driven solution on the basis of formal concept analysis (FCA) to identify human activities from non-intrusive sensor data. We extract the ontological correlations among sequential behavioral patterns. At the same time, these correlations are well organized in a graphical knowledge base, without intervention from domain experts. We propose an incremental lattice search strategy in order to retrieve the best inference given a few sensor events. Compared with other conventional probabilistic methods, our solution outperforms on the CASAS multi-resident benchmark dataset. Furthermore, we open up a promising solution of sequential pattern mining to discover the ontological features of temporal and sequential sensor data.
... First, our summarization process does not currently consider the time of occurrence of the tweets. Motivated by Time-Aware Knowledge Extraction (TAKE) methodology presented in De Maio et al. (2016), we plan to extend our summarization process to incorporate the temporal evolution of tweets by identifying temporal peaks of tweets frequency through analyzing their timestamps. Second, it would be interesting to evaluate the performance of the proposed summarization process in the context of questionanswering systems. ...
Article
Full-text available
Recent advances in microblog content summarization has primarily viewed this task in the context of traditional multi-document summarization techniques where a microblog post or their collection form one document. While these techniques already facilitate information aggregation, categorization and visualization of microblog posts, they fall short in two aspects: i) when summarizing a certain topic from microblog content, not all existing techniques take topic polarity into account. This is an important consideration in that the summarization of a topic should cover all aspects of the topic and hence taking polarity into account (sentiment) can lead to the inclusion of the less popular polarity in the summarization process. ii) Some summarization techniques produce summaries at the topic level. However, it is possible that a given topic can have more than one important aspect that need to have representation in the summarization process. Our work in this paper addresses these two challenges by considering both topic sentiments and topic aspects in tandem. We compare our work with the state of the art Twitter summarization techniques and show that our method is able to outperform existing methods on standard metrics such as ROUGE-1.
... Therefore, automatic labelling of key phrases from forum text increases the chance of getting answer which is the primary intention of posting a query on the web. In the literature, we find several work on automatic summarization of general domain text or twitter like microblog text [2,4,7,8] but we are unable to find any work on summarization of diagnosis questions. To develop the system, we use a hybrid strategy where a set of statistical measures, syntactic information and semantic information are combined. ...
... The era of Big data that we entered several years ago has changed our imagination about the type and the volume of data that can be processed, as well as the value of data. This is now visible in many fields which are experiencing an explosion of data that are considered relevant, including social networks [1], [2], [3], [4], [5], [6], [7], [8], multimedia processing [9], internet of things (IoT) [10], intelligent transport [11], [12], [13], [14], medicine and bioinformatics [15], finance [16], and many others [17], [18], that face the problem of big data. The big data problem (or opportunity) usually arises when data sets are so large that the conventional database management and data analysis tools are insufficient to process them [19]. ...
Article
In recent years, many fields that experience a sudden proliferation of data, which increases the volume of data that must be processed and the variety of formats the data is stored in have been identified. This causes pressure on existing compute infrastructures and data analysis methods, as more and more data is considered as a useful source of information for making critical decisions in particular fields. Among these fields exist several areas related to human life, e.g., various branches of medicine, where the uncertainty of data complicates the data analysis, and where the inclusion of fuzzy expert knowledge in data processing brings many advantages. In this paper, we show how fuzzy techniques can be incorporated in Big Data analytics carried out with the declarative U-SQL language over a Big Data Lake located on the Cloud. We define the concept of Big Data Lake together with the Extract, Process, and Store (EPS) process performed while schematizing and processing data from the Data Lake, and while storing results of the processing. Our solution, developed as a Fuzzy Search Library for Data Lake, introduces the possibility of (1) massively-parallel, declarative querying of Big Data Lake with simple and complex fuzzy search criteria, (2) using fuzzy linguistic terms in various data transformations, and (3) fuzzy grouping. Presented ideas are exemplified by a distributed analysis of large volumes of biomedical data on Microsoft Azure cloud. Results of performed tests confirm that the presented solution is highly scalable on the Cloud and is a successful step toward soft and declarative processing of data on a large scale. The solution presented in this paper directly addresses three characteristics of Big Data, i.e., volume, variety, and velocity, and indirectly addresses, veracity and value.
Article
Full-text available
Unexpected events occur frequently, and network public opinion prediction is one of the important research directions. Aiming at the problem that the current network public opinion prediction models mostly take improving the accuracy of the model as a breakthrough point, and lack the problem of exploring the law of public opinion communication, the study analyzes the current micro blog emergency propagation, focusing on introducing the implicit law of emotion vector, user browsing, and emergencies. At the same time, it studies the influencing factors causing the fluctuation of micro blog transmission of emergencies and selects the grey prediction model. The defects of the model are analyzed, and it has the constant increment problem and the lack of ability to deal with interference factors, and metabolic grey prediction model is used for the prediction of micro blog emergencies. At the same time, the concept of an incremental coefficient is introduced and the hybrid fuzzy neural network is adopted, the emotional knowledge is the key factor affecting the increment of grey prediction model. Use fuzzy neural network to analyze the micro blog emotional data generation, and obtain a mixed public opinion prediction model based on fuzzy neural network and grey prediction model. In the experimental process, the performance of the optimized prediction model is compared with that of the original prediction model, and a large number of data analyses prove that the optimized prediction model is effective. The experimental results show that the optimized prediction model has higher prediction accuracy.
Article
The motivation behind this examination is to explore the status and the development of the logical investigations for the impact of interpersonal organizations on enormous information and utilization of large information for displaying the interpersonal organizations clients' conduct. This paper presents a far reaching audit of the examinations related with enormous information in online media. The investigation utilizes Scopus information base as an essential web crawler and covers 2000 of profoundly refered to articles over the period 2012-2019. The records are genuinely broke down and feline egorized as far as various standards. The discoveries show that explores have developed dramatically since 2014 and the pattern has proceeded at generally stable rates. In view of the review, choice emotionally supportive networks is the catchphrase which has conveyed the most noteworthy densities followed by heuristics techniques. Among the most refered to articles, papers distributed by re-searchers in United States have gotten the most noteworthy references (7548), trailed by United Kingdom (588) and China with 543 ci-tations. Topical investigation shows that the subject almost kept a significant and well-devel-oped research field and for better outcomes we can combine our exploration with "huge information examination" and "twitter" that are significant points in this field yet not grew well.
Article
Microblog summarization systems are gaining importance during natural disasters. A lot of tweets are posted along with multimedia content during the occurrence of any natural disaster event. Extracting relevant information/summary from these tweets is important for the smooth functioning of the rescue operation. Moreover, because of the limited size of the tweets, in many cases, tweets are associated with images. The current work is the first of its kind where both the image and the tweet text are utilized simultaneously to generate a summary from microblog data generated during a disaster event. Different aspects, such as syntactic similarity, the maximum length of the tweets, retweet score, and antiredundancy, are considered as objective functions and those are simultaneously optimized using a metaheuristic population-based evolutionary strategy to select a good set of tweets to form a good quality summary. In order to extract information from images, a dense captioning model is utilized and the dense captions are further utilized for calculating the antiredundancy measure. We employed word mover distance to capture the semantic similarity between two tweets. Due to the unavailability of the dataset for multimodal microblog summarization tasks in a disaster-event scenario, datasets are created and made openly available to the community. The obtained summarization results are evaluated using the well-known ROUGE measure.
Article
Social Media (SM) are the most widespread and rapid data generation applications on the Internet increase the study of these data. However, the efficient processing of such massive data is challenging, so we require a system that learns from these data, like machine learning. Machine learning methods make the systems to learn itself. Many papers are published on SM using machine learning approaches over the past few decades. In this paper, we provide a comprehensive survey of multiple applications of SM analysis using robust machine learning algorithms. Initially, we discuss a summary of machine learning algorithms, which are used in SM analysis. After that, we provide a detailed survey of machine learning approaches to SM analysis. Furthermore, we summarize the challenges and benefits of Machine Learning usages in SM analysis. Finally, we presented open issues and consequences in SM analysis for further research.
Article
In online learning, the dropout phenomenon is a relevant issue to address with practical solutions. Several data sets stimulate original, and resolutive data analysis approaches, demonstrating the importance of the dropout phenomenon. This study proposes a novel approach to predicting massive online open course (MOOC) students at risk of dropout stressing the need to consider the temporal dimension in the data log. The proposal aims to build a data‐driven decision support system able to identify students at risk of dropout based on the conceptualization of such students' behavior and its evolution along the time dimension. The primary theoretical model behind the proposed method is the formal concept analysis, and its temporal extension (i.e., temporal concept analysis) for analyzing timestamped data and carrying out a timed lattice. The main result of the paper is a method to extract behavioral patterns of MOOC students at risk of dropout. Such patterns are defined as Time‐based Behavior Rules extracted from the aforementioned timed lattice obtained through the preprocessing of MOOC platform log files. The resulting rule set can be easily integrated for implementing educational DSS, as shown in the last part of the paper. The conducted experiments reveal promising results in terms of F‐score and students' monitoring time.
Article
Full-text available
The Internet has become a distribution center of ideological and cultural information and an amplifier of public opinion. To dig, analyze and study hot public opinions on the Internet is an important means to fully understand what netizens are thinking and doing. This paper analyzes the function of public opinion system, and introduces the key technology to realize the prediction of network public opinion, which can well realize the prediction function of public opinion.
Article
Full-text available
In this paper, we have proposed a fusion of two architectures, self-organizing map and granular self-organizing map (SOM + GSOM), for solving the microblog summarization task where a set of relevant tweets are extracted from the available set of tweets. SOM is used to reduce the available set of tweets to a smaller subset, and GSOM is used for extracting relevant tweets. The fusion of SOM + SOM is also accomplished to illustrate the effectiveness of GSOM over SOM in the second architecture. Moreover, only SOM version is also utilized to illustrate the potentiality of fusion in our proposed approaches. As similarity/dissimilarity measures play major role in any summarization system; therefore, to measure the same between tweets, various measures like word mover distance, cosine distance and Euclidean distance are also explored. The results obtained are evaluated on four datasets related to disaster events using ROUGE measures. Experimental results demonstrate that our best-proposed approach (SOM + GSOM) has obtained \(17\%\) and \(5.9\%\) improvements in terms of ROUGE-2 and ROUGE-L scores, respectively, over the existing techniques. The results are also validated using statistical significance t-test.
Article
In recent years, social networking sites such as Twitter have become the primary sources for real-time information of ongoing events such as political rallies, natural disasters, and so on. At the time of occurrence of natural disasters, it has been seen that relevant information collected from tweets can help in different ways. Therefore, there is a need to develop an automated microblog/tweet summarization system to automatically select relevant tweets. In this article, we employ the concept of multiobjective optimization in microblog summarization to produce good quality summaries. Different statistical quality measures namely, length, tf-idf score of the tweets, antiredundancy, measuring different aspects of summary, are optimized simultaneously using the search capability of a multiobjective differential evolution technique. Different types of genetic operators including recently developed self-organizing map (a type of neural network) based operator, are explored in the proposed framework. To measure the similarity between tweets, word mover distance is utilized which is capable of capturing the semantic similarity between tweets. For evaluation, four benchmark data sets related to disaster events are used, and the results obtained are compared with various state-of-the-art techniques using ROUGE measures. It has been found that our algorithm improves by 62.37% and 5.65% in terms of ROUGE-2 and ROUGE-L scores, respectively, over the state-of-the-art techniques. Results are also validated using statistical significance t-test. At the end of the article, extension of the proposed approach to solve the multidocument summarization task is also illustrated.
Article
Due to the considerable growth of the volume of text documents on the Internet and in digital libraries, manual analysis of these documents is no longer feasible. Having efficient approaches to keyword extraction in order to retrieve the ‘key’ elements of the studied documents is now a necessity. Keyword extraction has been an active research field for many years, covering various applications in Text Mining, Information Retrieval, and Natural Language Processing, and meeting different requirements. However, it is not a unified domain of research. In spite of the existence of many approaches in the field, there is no single approach that effectively extracts keywords from different data sources. This shows the importance of having a comprehensive review, which discusses the complexity of the task and categorizes the main approaches of the field based on the features and methods of extraction that they use. This paper presents a general introduction to the field of keyword/keyphrase extraction. Unlike the existing surveys, different aspects of the problem along with the main challenges in the field are discussed. This mainly includes the unclear definition of ‘keyness’, complexities of targeting proper features for capturing desired keyness properties and selecting efficient extraction methods, and also the evaluation issues. By classifying a broad range of state-of-the-art approaches and analysing the benefits and drawbacks of different features and methods, we provide a clearer picture of them. This review is intended to help readers find their way around all the works related to keyword extraction and guide them in choosing or designing a method that is appropriate for the application they are targeting.
Article
Full-text available
The purpose of this research is to investigate the status and the evolution of the scientific studies for the effect of social networks on big data and usage of big data for modeling the social networks users’ behavior. This paper presents a comprehensive review of the studies associated with big data in social media. The study uses Scopus database as a primary search engine and covers 2000 of highly cited articles over the period 2012-2019. The records are statistically analyzed and categorized in terms of different criteria. The findings show that researches have grown exponentially since 2014 and the trend has continued at relatively stable rates. Based on the survey, decision support systems is the key-word which has carried the highest densities followed by heuristics methods. Among the most cited articles, papers published by re-searchers in the United States have received the highest citations (7548), followed by United Kingdom (588) and China with 543 citations. The thematic analysis shows that the subject nearly maintained an important and well-developed research field and for better results, we can merge our research with “big data analytics” and “twitter” that are important topics in this field but not developed well.
Article
There are two fundamental difficulties that are still hindering the development of microblog summarization. The first problem is the features sparseness of microblog, which restricts the performance of sub-topics detection. The second one is the sentence selection from sub-topics that is based mainly on centrality approaches to measure sentence salience. Also, the semantic features and relations features between sentences and sub-topics were not given much attention. In order to address the two aforementioned problems, we propose a summarization method considering Paragraph Vector and semantic structure. Firstly, we construct sentence similarity matrix that involves the contextual information of microblogs to detect sub-topics by using Paragraph Vector. Secondly, we analyze the sentences by utilizing Chinese Sentential Semantic Model (CSM) to get semantic features; then the relations features are obtained based on the similarity matrix and semantic features above. Finally, the most informative sentences can be selected accurately from microblogs belonging to the same sub-topics by semantic features and relation features. The experimental results show that the ROUGE-1 value is up to 53.17% with 1.5% compression ratio. The results indicate that applying Paragraph Vector to the field of microblog summarization can effectively improve sub-topics detection. Additionally, semantic features and relation features enhance summarization result jointly. Furthermore, CSM provides a promising tool for sentence semantic analysis.
Article
Full-text available
Social media services such as Twitter generate phenomenal volume of content for most real-world events on a daily basis. Digging through the noise and redundancy to understand the important aspects of the content is a very challenging task. We propose a search and summarization framework to extract relevant representative tweets from a time-ordered sample of tweets to generate a coherent and concise summary of an event. We introduce two topic models that take advantage of temporal correlation in the data to extract relevant tweets for summarization. The summarization framework has been evaluated using Twitter data on four real-world events. Evaluations are performed using Wikipedia articles on the events as well as using Amazon Mechanical Turk (MTurk) with human readers (MTurkers). Both experiments show that the proposed models outperform traditional LDA and lead to informative summaries. Copyright © 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Conference Paper
Full-text available
Grooming is the process by which pedophiles try to find children on the internet for sex-related purposes. In chat conversations they may try to establish a connection and escalate the conversation towards a physical meeting. Till date no effective methods exist for quickly analyzing the contents, evolution over time, the present state and threat level of these chat conversations. In this paper we propose a novel method based on Temporal Relational Semantic Systems, the main structure in the temporal and relational version of Formal Concept Analysis. For rapidly gaining insight into the topics of chat conversations we combine a linguistic ontology for chatterms with conceptual scaling and represent the dynamics of chats by life tracks in nested line diagrams. To showcase the possibilities of our approach we used chat conversations of a private American organization which actively searches for pedophiles on the internet.
Article
Full-text available
During recent years, socially generated content has become pervasive on the World Wide Web. The enormous amount of content generated in blog sites, social networking sites such as Facebook and Myspace, encyclopedic sites such as Wikipedia, has not only empowered ordinary users of the Web but also contributed to the vastness as well as richness of the Web's contents. In this paper, we focus on a recent trend called microblogging, and in particular a site called Twitter that allows a huge number of users to contribute frequent short messages. The content of such a site is an extra-ordinarily large number of small textual messages, posted by millions of users, at random or in response to perceived events or situations. However, out of such random and massive disorganization of messages usually trends emerge as a large number of users post similar messages on similar topics. These trends can be discovered using statistical analysis of mass of posts. We have developed an algorithm that takes a trending phrase or any phrase specified by a user, collects a large number of posts containing the phrase, and provides an automatically created summary of the posts related to the term. We present examples of summaries we produce along with initial qualitative evaluation. It is possible to get a global view of the content of the text message repository in terms of a set of short summaries of trending terms during the course of a period of time such as an hour or a day.
Conference Paper
Full-text available
In traditional text clustering methods, documents are represented as "bags of words" without considering the semantic information of each document. For instance, if two documents use different collections of core words to represent the same topic, they may be falsely assigned to different clusters due to the lack of shared core words, although the core words they use are probably synonyms or semantically associated in other forms. The most common way to solve this problem is to enrich document representation with the background knowledge in an ontology. There are two major issues for this approach: (1) the coverage of the ontology is limited, even for WordNet or Mesh, (2) using ontology terms as replacement or additional features may cause information loss, or introduce noise. In this paper, we present a novel text clustering method to address these two issues by enriching document representation with Wikipedia concept and category information. We develop two approaches, exact match and relatedness-match, to map text documents to Wikipedia concepts, and further to Wikipedia categories. Then the text documents are clustered based on a similarity metric which combines document content information, concept information as well as category information. The experimental results using the proposed clustering framework on three datasets (20-newsgroup, TDT2, and LA Times) show that clustering performance improves significantly by enriching document representation with Wikipedia concepts and categories.
Conference Paper
Full-text available
Microblogs are a tremendous repository of user-generated content about world events. However, for people trying to understand events by querying services like Twitter, a chronological log of posts makes it very difficult to get a detailed understanding of an event. In this paper, we present TwitInfo, a system for visualizing and summarizing events on Twitter. TwitInfo allows users to browse a large collec-tion of tweets using a timeline-based display that highlights peaks of high tweet activity. A novel streaming algorithm automatically discovers these peaks and labels them mean-ingfully using text from the tweets. Users can drill down to subevents, and explore further via geolocation, sentiment, and popular URLs. We contribute a recall-normalized ag-gregate sentiment visualization to produce more honest sen-timent overviews. An evaluation of the system revealed that users were able to reconstruct meaningful summaries of events in a small amount of time. An interview with a Pulitzer Prize-winning journalist suggested that the system would be especially useful for understanding a long-running event and for identifying eyewitnesses. Quantitatively, our system can identify 80-100% of manually labeled peaks, fa-cilitating a relatively complete view of each event studied.
Conference Paper
Full-text available
Television broadcasters are beginning to combine social micro-blogging systems such as Twitter with television to create social video experiences around events. We looked at one such event, the first U.S. presidential debate in 2008, in conjunction with aggregated ratings of message sentiment from Twitter. We begin to develop an analytical methodol- ogy and visual representations that could help a journalist or public affairs person better understand the temporal dy- namics of sentiment in reaction to the debate video. We demonstrate visuals and metrics that can be used to detect sentiment pulse, anomalies in that pulse, and indications of controversial topics that can be used to inform the design of visual analytic systems for social media events.
Conference Paper
Full-text available
This paper presents algorithms for summarizing microblog posts. In particular, our algorithms process collections of short posts on specific topics on the well-known site called Twitter and create short summaries from these collections of posts on a specific topic. The goal is to produce summaries that are similar to what a human would produce for the same collection of posts on a specific topic. We evaluate the summaries produced by the summarizing algorithms, compare them with human-produced summaries and obtain excellent results.
Conference Paper
Full-text available
This paper presents online topic model (OLDA), a topic model that automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. Our approach allows the topic modeling framework, specifically the latent Dirichlet allocation (LDA) model, to work in an online fashion such that it incrementally builds an up-to-date model (mixture of topics per document and mixture of words per topic) when a new document (or a set of documents) appears. A solution based on the empirical Bayes method is proposed. The idea is to incrementally update the current model according to the information inferred from the new stream of data with no need to access previous data. The dynamics of the proposed approach also provide an efficient mean to track the topics over time and detect the emerging topics in real time. Our method is evaluated both qualitatively and quantitatively using benchmark datasets. In our experiments, the OLDA has discovered interesting patterns by just analyzing a fraction of data at a time. Our tests also prove the ability of OLDA to align the topics across the epochs with which the evolution of the topics over time is captured. The OLDA is also comparable to, and sometimes better than, the original LDA in predicting the likelihood of unseen documents.
Article
Full-text available
In this paper we face the problem of specifying and verifying security protocols where temporal aspects explicitly appear in the description. For these kinds of protocols we have designed a specification formalism, which consists of a state-transition graph for each participant of the protocol, with edges labelled by trigger/action clauses. The specification of a protocol is translated into a Timed Automaton on which standard techniques of model checking can be exploited (properties to be checked can be expressed in a linear/branching untimed/timed temporal logic). Along all the presentation we use, as running example, a two parties non-repudiation protocol for which we show how our framework applies in the verification of the fairness property for the protocol (establishing whether there is a step of the protocol in which one of the two participants can take any advantage over the other).
Article
Full-text available
A hierarchical state machine (Hsm) is a finite state machine where a vertex can either expand to another hierarchical state machine (box) or be a basic vertex (node). Each node is labeled with atomic propositions. We study an extension of such model which allows atomic propositions to label also boxes (Shsm). We show that Shsms can be exponentially more succinct than Shsms and verification is in general harder by an exponential factor. We carefully establish the computational complexity of reachability, cycle detection, and model checking against general Ltl and Ctl specifications. We also discuss some natural and interesting restrictions of the considered problems for which we can prove that Shsms can be verified as much efficiently as Hsms, still preserving an exponential gap of succinctness.
Article
Full-text available
We introduce the concept of a Visual Backchannel as a novel way of following and exploring online conversations about large-scale events. Microblogging communities, such as Twitter, are increasingly used as digital backchannels for timely exchange of brief comments and impressions during political speeches, sport competitions, natural disasters, and other large events. Currently, shared updates are typically displayed in the form of a simple list, making it difficult to get an overview of the fast-paced discussions as it happens in the moment and how it evolves over time. In contrast, our Visual Backchannel design provides an evolving, interactive, and multi-faceted visual overview of large-scale ongoing conversations on Twitter. To visualize a continuously updating information stream, we include visual saliency for what is happening now and what has just happened, set in the context of the evolving conversation. As part of a fully web-based coordinated-view system we introduce Topic Streams, a temporally adjustable stacked graph visualizing topics over time, a People Spiral representing participants and their activity, and an Image Cloud encoding the popularity of event photos by size. Together with a post listing, these mutually linked views support cross-filtering along topics, participants, and time ranges. We discuss our design considerations, in particular with respect to evolving visualizations of dynamically changing data. Initial feedback indicates significant interest and suggests several unanticipated uses.
Conference Paper
Full-text available
Association rule mining is an exploratory learning task to discover some hidden dependency relationships among items in transaction data. Quantitative association rules denote association rules with both categorical and quantitative attributes. There have been several works on quantitative association rule mining such as the application of fuzzy techniques to quantitative association rule mining, the generalized association rule mining for quantitative association rules, and importance weight incorporation into association rule mining for taking into account the user's interest. This paper introduces a new method for generalized fuzzy quantitative association rule mining with importance weights. The method uses fuzzy concept hierarchies for categorical attributes and generalization hierarchies of fuzzy linguistic terms for quantitative attributes. It enables the users to flexibly perform the association rule mining by controlling the generalization levels for attributes and the importance weights for attributes
Article
We present TweetMotif, an exploratory search applica- tion for Twitter. Unlike traditional approaches to in- formation retrieval, which present a simple list of mes- sages, TweetMotif groups messages by frequent signif- icant terms — a result set’s subtopics — which facili- tate navigation and drilldown through a faceted search interface. The topic extraction system is based on syn- tactic filtering, language modeling, near-duplicate de- tection, and set cover heuristics. We have used Tweet- Motif to deflate rumors, uncover scams, summarize sentiment, and track political protests in real-time. A demo of TweetMotif, plus its source code, is available at http://tweetmotif.com.
Article
Microblogging concurrently with live media events is becoming commonplace. The resulting comment stream represents a parallel, social conversational reflection on the event. Although not formally `attached' to the actual event stream itself, we demonstrate it is possible to establish a relationship between the two streams by mapping their structural properties. In this article, we examine: How do people produce and consume real-time commentary? And how does the structure of commentary and conversation change in response to moments of interest? Using a dataset of 53,712 Twitter posts, or tweets, sampled during the inauguration of Barack Obama in January 2009, we develop methods for exploring these questions. We find that short message activity reflects the structure and content of this media event. Specifically, messages directed at large audiences can serve as broadcast announcements, while variations in the level of conversation can reflect levels of interest in the media event itself. Finally, we present some implications for the design of future tools for a variety of users ranging from consumers to journalists.
Article
Based on Formal Concept Analysis, we introduce Temporal Concept Analysis as a temporal conceptual granularity theory for movements of general objects in abstract or "real" space and time such that the notions of states, situations, transitions and life tracks of objects in conceptual time systems are defined mathematically. The life track lemma is a first approach to granularity reasoning. Applications of Temporal Concept Analysis in medicine and in chemical industry are demonstrated as well as recent developments of computer programs for graphical representations of temporal systems. Basic relations between Temporal Concept Analysis and other temporal theories, namely theoretical physics, mathematical system theory, automata theory, and temporal logic are discussed.
Conference Paper
In this paper, we propose a method for short text categorization using topic model and integrated classifier. To enrich the representation of short text, the Latent Dirichlet Allocation (LDA) model is used to extract latent topic information. While for classification, we combine two classifiers for achieving high reliability. Particularly, we train LDA models with variable number of topics using the Wikipedia corpus as external knowledge base, and extend labeled Web snippets by potential topics extracted by LDA. Then, the enriched representation of snippets are used to learn Maximum Entropy (MaxEnt) and support vector machine (SVM) classifiers separately. Finally, viewing that the most possible predicted result will appear in the top two candidates selected by MaxEnt classifier, we develop a novel scheme that if the gap between these candidates is large enough, the predicted result is considered to be reliable; otherwise, the SVM classifier will be integrated with MaxEnt classifier to make a comprehensive prediction. Experimental results show that our framework is effective and can outperform the state-of-the-art techniques.
Article
Owing to the sheer volume of text generated by a microblog site like Twitter, it is often difficult to fully understand what is being said about various topics. This paper presents algorithms for summarizing microblog documents. Initially, we present algorithms that produce single-document summaries but later extend them to produce summaries containing multiple documents. We evaluate the generated summaries by comparing them to both manually produced summaries and, for the multiple-post summaries, to the summarization results of some of the leading traditional summarization systems.
Article
As an information delivering platform, Twitter collects millions of tweets every day. However, some users, especially new users, often find it difficult to understand trending topics in Twitter when confronting the overwhelming and unorganized tweets. Existing work has attempted to provide a short snippet to explain a topic, but this only provides limited benefits and cannot satisfy the users' expectations. In this paper, we propose a new summarization task, namely sequential summarization, which aims to provide a serial of chronologically ordered short sub-summaries for a trending topic in order to provide a complete story about the development of the topic while retaining the order of information presentation. Different from the traditional summarization task, the numbers of sub-summaries for different topics are not fixed. Two approaches, i.e., stream-based and semantic-based approaches, are developed to detect the important subtopics within a trending topic. Then a short sub-summary is generated for each subtopic. In addition, we propose three new measures to evaluate the position-aware coverage, sequential novelty and sequence correlation of the system-generated summaries. The experimental results based on the proposed evaluation criteria have demonstrated the effectiveness of the proposed approaches.
Conference Paper
User-contributed content is creating a surge on the Internet. A list of "buzzing topics" can effectively monitor the surge and lead people to their topics of interest. Yet a topic phrase alone, such as "SXSW", can rarely present the information clearly. In this paper, we propose to explore a variety of text sources for summarizing the Twitter topics, including the tweets, normalized tweets via a dedicated tweet normalization system, web contents linked from the tweets, as well as integration of different text sources. We employ the concept-based optimization framework for topic summarization, and conduct both automatic and human evaluation regarding the summary quality. Performance differences are observed for different input sources and types of topics. We also provide a comprehensive analysis regarding the task challenges.
Conference Paper
With the explosive growth of microblogging services, short-text messages (also known as tweets) are being created and shared at an unprecedented rate. Tweets in its raw form can be incredibly informative, but also overwhelming. For both end-users and data analysts it is a nightmare to plow through millions of tweets which contain enormous noises and redundancies. In this paper, we study continuous tweet summarization as a solution to address this problem. While traditional document summarization methods focus on static and small-scale data, we aim to deal with dynamic, quickly arriving, and large-scale tweet streams. We propose a novel prototype called Sumblr (SUMmarization By stream cLusteRing) for tweet streams. We first propose an online tweet stream clustering algorithm to cluster tweets and maintain distilled statistics called Tweet Cluster Vectors. Then we develop a TCV-Rank summarization technique for generating online summaries and historical summaries of arbitrary time durations. Finally, we describe a topic evolvement detection method, which consumes online and historical summaries to produce timelines automatically from tweet streams. Our experiments on large-scale real tweets demonstrate the efficiency and effectiveness of our approach.
Conference Paper
Ontology has been extensively applied in various fields, such as artificial intelligence, information extraction and retrieval et al. In this paper we describe a new approach for automatic learning terminological ontology. The method takes the topics generated by generative topic model as concepts and builds subsumption relationships between such concepts to learn ontology without the existence of seed ontology. The method presents CosTMI measure to compute semantic similarity between topics and to organize these topics into hierarchy structure and form new ontology. We evaluate our method using real world text dataset GENIA corpus which is a collection of biomedical literature. And the experiment results demonstrate the validity and efficiency of proposed method.
Article
With the widespread applications of electronic learning (e-Learning) technologies to education at all levels, increasing number of online educational resources and messages are generated from the corresponding e-Learning environments. Nevertheless, it is quite difficult, if not totally impossible, for instructors to read through and analyze the online messages to predict the progress of their students on the fly. The main contribution of this paper is the illustration of a novel concept map generation mechanism which is underpinned by a fuzzy domain ontology extraction algorithm. The proposed mechanism can automatically construct concept maps based on the messages posted to online discussion forums. By browsing the concept maps, instructors can quickly identify the progress of their students and adjust the pedagogical sequence on the fly. Our initial experimental results reveal that the accuracy and the quality of the automatically generated concept maps are promising. Our research work opens the door to the development and application of intelligent software tools to enhance e-Learning.
Article
A fuzzy set is a class of objects with a continuum of grades of membership. Such a set is characterized by a membership (characteristic) function which assigns to each object a grade of membership ranging between zero and one. The notions of inclusion, union, intersection, complement, relation, convexity, etc., are extended to such sets, and various properties of these notions in the context of fuzzy sets are established. In particular, a separation theorem for convex fuzzy sets is proved without requiring that the fuzzy sets be disjoint.
Conference Paper
The detection of new information in a document stream is an important component of many potential applications. In this work, a new novelty detection approach based on the identification of sentence level information patterns is proposed. First, the information- pattern concept for novelty detection is presented with the emphasis on new information patterns for general topics (queries) that cannot be simply turned into specific questions whose answers are specific named entities (NEs). Then we elaborate a thorough analysis of sentence level information patterns on data from the TREC novelty tracks, including sentence lengths, named entities, sentence level opinion patterns. This analysis provides guidelines in applying those patterns in novelty detection particularly for the general topics. Finally, a unified pattern-based approach is presented to novelty detection for both general and specific topics. The new method for dealing with general topics will be the focus. Experimental results show that the proposed approach significantly improves the performance of novelty detection for general topics as well as the overall performance for all topics from the 2002-2004 TREC novelty tracks.
Conference Paper
We propose a system for detecting local events in the real-world using geolocation information from microblog documents. A local event happens when people with a common purpose gather at the same time and place. To detect such an event, we identify a group of Twitter documents describing the same theme that were generated within a short time and a small geographic area. Timestamps and geotags are useful for finding such documents, but only 0.7% of documents are geotagged and not sufficient for this purpose. Therefore, we propose an automatic geotagging method that identifies the location of non-geotagged documents. Our geotagging method successfully increased the number of geographic groups by about 115 times. For each group of documents, we extract co-occurring terms to identify its theme and determine whether it is about an event. We subjectively evaluated the precision of our detected local events and found that it had 25.5% accuracy. These results demonstrate that our system can detect local events that are difficult to identify using existing event detection methods. A user can interactively specify the size of a desired event by manipulating the parameters of date, area size, and the minimum number of Twitter users associated with the location. Our system allows users to enjoy the novel experience of finding a local event happening near their current location in real time.
Conference Paper
Automatic taxonomy generation deals with organizing text documents in terms of an unknown labeled hierarchy. The main issues here are (i) how to identify documents that have similar content, (ii) how to discover the hierarchical structure of the topics and subtopics, and (iii) how to find appropriate labels for each of the topics and subtopics. In this paper, we review several approaches to automatic taxonomy generation to provide an insight into the issues involved. We also describe how fuzzy hierarchies can overcome some of the problems associated with traditional crisp taxonomies.
Conference Paper
We present TweetMotif, an exploratory search applica- tion for Twitter. Unlike traditional approaches to in- formation retrieval, which present a simple list of mes- sages, TweetMotif groups messages by frequent signif- icant terms — a result set's subtopics — which facili- tate navigation and drilldown through a faceted search interface. The topic extraction system is based on syn- tactic filtering, language modeling, near-duplicate de- tection, and set cover heuristics. We have used Tweet- Motif to deflate rumors, uncover scams, summarize sentiment, and track political protests in real-time. A demo of TweetMotif, plus its source code, is available at http://tweetmotif.com.
Conference Paper
Based on Formal Concept Analysis, we introduce Temporal Concept Analysis as a temporal conceptual granularity theory for movements of general objects in abstract or “real” space and time such that the notions of states, situations, transitions and life tracks of objects in conceptual time systems are defined mathematically. The life track lemma is a first approach to granularity reasoning. Applications of Temporal Concept Analysis in medicine and in chemical industry are demonstrated as well as recent developments of computer programs for graphical representations of temporal systems. Basic relations between Temporal Concept Analysis and other temporal theories, namely theoretical physics, mathematical system theory, automata theory, and temporal logic are discussed.
Article
Nowadays, Web 2.0 focuses on user generated content, data sharing and collaboration activities. Formats like Really Simple Syndication (RSS) provide structured Web information, display changes in summary form and stay updated about news headlines of interest. This trend has also affected the e-learning domain, where RSS feeds demand for dynamic learning activities, enabling learners and teachers to access to new blog posts, to keep track of new shared media, to consult Learning Objects which meet their needs.This paper presents an approach to enrich personalized e-learning experiences with user-generated content, through a contextualized RSS-feeds fruition. The synergic exploitation of Knowledge Modeling and Formal Concept Analysis techniques enables the design and development of a system that supports learners in their learning activities by collecting, conceptualizing, classifying and providing updated information on specific topics coming from relevant information sources. An agent-based layer supervises the extraction and filtering of RSS feeds whose topics cover a specific educational domain.
Article
This paper introduces the use of Wikipedia as a resource for automatic keyword extraction and word sense disambiguation, and shows how this online encyclopedia can be used to achieve state-of-the-art results on both these tasks. The paper also shows how the two methods can be combined into a system able to automatically enrich a text with links to encyclopedic knowledge. Given an input document, the system identifies the important concepts in the text and automatically links these concepts to the corresponding Wikipedia pages. Evaluations of the system show that the automatic annotations are reliable and hardly distinguishable from manual annotations. providing the users a quick way of accessing additional information. Wikipedia contributors perform these annotations by hand following a Wikipedia“manual of style,”which gives guidelines concerning the selection of important concepts in a text, as well as the assignment of links to appropriate related articles. For instance, Figure 1 shows an example of a Wikipedia page, including the definition for one of the meanings of the word “plant.”
Article
Information retrieval is the important work for Electronic Commerce. Ontology-based semantic retrieval is a hotspot of current research. In order to achieve fuzzy semantic retrieval, this paper applies a fuzzy ontology framework to information retrieval system in E-Commerce. The framework includes three parts: concepts, properties of concepts and values of properties, in which property’s value can be either standard data types or linguistic values of fuzzy concepts. The semantic query expansions are constructed by order relation, equivalence relation, inclusion relation, reversion relation and complement relation between fuzzy concepts defined in linguistic variable ontologies with Resource Description Framework (RDF). The application to retrieve customer, product and supplier information shows that the framework can overcome the localization of other fuzzy ontology models, and this research facilitates the semantic retrieval of information through fuzzy concepts on the Semantic Web.
Conference Paper
This paper describes an algorithm for building fuzzy hierarchies. These are hierarchies where the elements can have fuzzy membership to the nodes. The paper presents an approach that mainly follows a bottom-up strategy, and describes the functions needed to operate with fuzzy variables. An example of the application of the approach is also presented
Article
In this paper, we extend the work of Kraft et al. to present a new method for fuzzy information retrieval based on fuzzy hierarchical clustering and fuzzy inference techniques. First, we present a fuzzy agglomerative hierarchical clustering algorithm for clustering documents and to get the document cluster centers of document clusters. Then, we present a method to construct fuzzy logic rules based on the document clusters and their document cluster centers. Finally, we apply the constructed fuzzy logic rules to modify the user's query for query expansion and to guide the information retrieval system to retrieve documents relevant to the user's request. The fuzzy logic rules can represent three kinds of fuzzy relationships (i.e., fuzzy positive association relationship, fuzzy specialization relationship and fuzzy generalization relationship) between index terms. The proposed fuzzy information retrieval method is more flexible and more intelligent than the existing methods due to the fact that it can expand users' queries for fuzzy information retrieval in a more effective manner.
Conference Paper
Formal Concept Analysis is an unsupervised learning technique for conceptual clustering. We introduce the notion of iceberg concept lattices and show their use in Knowledge Discovery in Databases (KDD). Iceberg lattices are designed for analyzing very large databases. In particular they serve as a condensed representation of frequent patterns as known from association rule mining. In order to show the interplay between Formal Concept Analysis and association rule mining, we discuss the algorithm Titanic. We show that iceberg concept lattices are a starting point for computing condensed sets of association rules without loss of information, and are a visualization method for the resulting rules.
Jasmine: A real-time localevent detection system based on geolocation information propagated to microblogs
  • K Watanabe
  • M Ochi
  • M Okabe
  • R Onai
K. Watanabe, M. Ochi, M. Okabe, R. Onai, Jasmine: A real-time localevent detection system based on geolocation information propagated to microblogs, in: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM '11, ACM, New York, NY, USA, 2011, pp. 2541-2544. doi:10.1145/2063576.2064014. URL http://doi.acm.org/10.1145/2063576.2064014
Enhancing query-oriented summarization based on sentence wikification
  • Y Miao
  • C Li
Y. Miao, C. Li, Enhancing query-oriented summarization based on sentence wikification, in: Workshop of the 33 rd Annual International, 2010, p. 32.
Twitinfo: Aggregating and visualizing microblogs for event exploration
  • A Marcus
  • M S Bernstein
  • O Badar
  • D R Karger
  • S Madden
  • R C Miller
A. Marcus, M. S. Bernstein, O. Badar, D. R. Karger, S. Madden, R. C. Miller, Twitinfo: Aggregating and visualizing microblogs for event exploration, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '11, ACM, New York, NY, USA, 2011, pp. 227-236. doi:10.1145/1978942.1978975. URL http://doi.acm.org/10.1145/1978942.1978975
Sumblr: Continuous summarization of evolving tweet streams
  • L Shou
  • Z Wang
  • K Chen
  • G Chen
L. Shou, Z. Wang, K. Chen, G. Chen, Sumblr: Continuous summarization of evolving tweet streams, in: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '13, ACM, New York, NY, USA, 2013, pp. 533-542. doi:10.1145/ 2484028.2484045. URL http://doi.acm.org/10.1145/2484028.2484045
A visual backchannel for large-scale events
  • M Dork
  • D Gruen
  • C Williamson
  • S Carpendale
M. Dork, D. Gruen, C. Williamson, S. Carpendale, A visual backchannel for large-scale events, IEEE Transactions on Visualization and Computer Graphics 16 (6) (2010) 1129-1138. doi:http://doi. ieeecomputersociety.org/10.1109/TVCG.2010.129.
Towards a temporal extension of formal concept analysis
  • R Neouchi
  • A Tawfik
  • R Frost
R. Neouchi, A. Tawfik, R. Frost, Towards a temporal extension of formal concept analysis, in: E. Stroulia, S. Matwin (Eds.), Advances in Artificial Intelligence, Vol. 2056 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2001, pp. 335-344. doi:10.1007/3-540-45153-6_33. URL http://dx.doi.org/10.1007/3-540-45153-6_33
Towards a temporal extension of formal concept analysis
  • Neouchi
Soft computing for information retrieval on the web
  • G Bordogna
  • M Pagani
  • G Pasi