Fig 1 - uploaded by Ahmet Arif Aydin
Content may be subject to copyright.
EPIC Analyze Software Architecture  

EPIC Analyze Software Architecture  

Source publication
Conference Paper
Full-text available
Web-based data analysis environments are powerful platforms for exploring large data sets. To ensure that these environments meet the needs of analysts, a human-centered perspective is needed. Interfaces to these platforms should provide flexible search, support user-generated content, and enable collaboration. We report on our efforts to design an...

Contexts in source publication

Context 1
... Analyze [1] is a data analysis platform that builds on top of our pre- vious work on EPIC Collect [2,18], a system designed for reliable and scalable social media collection. EPIC Analyze extends EPIC Collect with an architecture designed to support social media analytics (see Fig. 1). These systems support an analysis workflow that starts when an event of interest has been detected. Project EPIC analysts monitor Twitter for keywords of interest and use the EPIC Event Editor (a simple web application) to associate those keywords with a new event. EPIC Collect detects the presence of this new event and submits its ...
Context 2
... architecture for EPIC Analyze shown in Fig. 1 http://www.datastax.com/) and its integrated versions of Solr, Pig, and Hadoop. Each of these components can be used to help index, search, or process our large Twitter data sets. We make use of PostgreSQL to store comments and annotations made by analysts while working with EPIC Analyze. EPIC Analyze is itself implemented as a Ruby on ...

Citations

... This NoSQL database is focused on writes and provided high throughput to handle all incoming tweets. EPIC Analyze [3,7] is a web-based system that makes use of a variety of software frameworks (e.g., Hadoop, Solr, and Redis) to provide various analysis and annotation services for analysts on the large data sets collected by EPIC Collect. In addition, Project EPIC maintained one machine-known as EPIC Analytics-with a large amount of physical memory to allow analysts to run memory-intensive processes over the collected data. ...
... To explore events, an interface was created similar to the one described for EPIC Analyze [7]. There is a list of tweets with a timeline visualization above for time slicing the data set (see To allow more technical analysts to perform deeper analysis, the system also points to the internal BigQuery table in Google Cloud Console. ...
... MongoDB was proposed in [5], even though it was acknowledged that it had its limitations and the queries were slow to resolve. EPIC Analyze [3,7] addressed some of these issues by switching to an integration between Cassandra and Solr. ...
Preprint
Full-text available
Social media analysis of disaster events is a critical task in crisis informatics research. It involves analyzing social media data generated during natural disasters, crisis events, or other mass convergence events. Due to the large data sets generated during these events, large scale software infrastructures need to be designed to analyze the data in a timely manner. Creating such infrastructures bring the need to maintain them and this becomes more difficult as these infrastructures grow larger and older. Maintenance costs are high since there is a need for queries to be handled quickly which require large amounts of computational resources to be available on demand 24 hours a day, seven days a week. In this thesis, I describe an alternative approach to designing a software infrastructure for analyzing unstructured data on the cloud while providing fast queries and with the reliability needed for crisis informatics research. Additionally, I discuss a new approach for a more reliable Twitter stream collection using container orchestrated systems. I finally compare this new infrastructure with existing crisis informatics software infrastructures and compare their reliability, scalability and extensibility with my approach and my prototype.
... Usability is a notable quality for users since it provides remarkable ease in using and controlling software. The interface design for data analysis environments is critical to facilitate tasks of analysis for users without a good programming background [51,52]. For instance, the usability quality of TwitInfo [32] was achieved by designing a user-friendly interface based on gathered user feedback. ...
Article
Full-text available
Developing software systems to meet user-demanded functionality is critical. Achieving the design goals by providing the needed functionality is a necessary task, and it is about figuring out a proper set of quality attributes and implementing each one by reflecting a complete set of quality attributes. This study presents popular quality attributes of crisis software systems by conducting a literature review. Each crisis software system has been studied by concentrating on crisis management phases where the system is used, design purposes, and the data processing style. The findings of this research shed light on the crisis software development process by presenting a quality attribute-oriented perspective, addressing design challenges, and recommending to developers remedies to handle challenges.
... In contrast, Project EPIC, launched in 2009 and supported by a US National Science Foundation grant, is a multi-disciplinary effort involving several universities and languages with the goal of utilizing behavioral and technical knowledge of computer mediated communication for better crisis study and emergency response. Since its founding, Project EPIC has led to several advances in the crisis informatics space; see for example [8]- [12]. The work presented in this article is intended to be compatible with these efforts. ...
... To 'sync' the source and target domains, we consider a simple, but empirically effective, approach. Rather than use just the labeled target domain data for training the three linear regression models, we combine the labeled training 8 In the case of the two trained embedding models, by getting the respective sentence embeddings for the test message data from both the source and target domains, but the target training data is up-sampled to allow its properties to emerge more concretely in the training. The up-sampling margin is a parameter in Algorithm 1; in practice, a factor of 6 (meaning the target labeled dataset is up-sampled by 6x) has been found to work well. ...
Preprint
Full-text available
Humanitarian disasters have been on the rise in recent years due to the effects of climate change and socio-political situations such as the refugee crisis. Technology can be used to best mobilize resources such as food and water in the event of a natural disaster, by semi-automatically flagging tweets and short messages as indicating an urgent need. The problem is challenging not just because of the sparseness of data in the immediate aftermath of a disaster, but because of the varying characteristics of disasters in developing countries (making it difficult to train just one system) and the noise and quirks in social media. In this paper, we present a robust, low-supervision social media urgency system that adapts to arbitrary crises by leveraging both labeled and unlabeled data in an ensemble setting. The system is also able to adapt to new crises where an unlabeled background corpus may not be available yet by utilizing a simple and effective transfer learning methodology. Experimentally, our transfer learning and low-supervision approaches are found to outperform viable baselines with high significance on myriad disaster datasets.
... In contrast, Project EPIC, launched in 2009 and supported by a US National Science Foundation grant, is a multi-disciplinary effort involving several universities and languages with the goal of utilizing behavioral and technical knowledge of computer mediated communication for better crisis study and emergency response. Since its founding, Project EPIC has led to several advances in the crisis informatics space; see for example [12][13][14][15][16]. The work presented in this article is intended to be compatible with these efforts. ...
Article
Full-text available
Due to instant availability of data on social media platforms like Twitter, and advances in machine learning and data management technology, real-time crisis informatics has emerged as a prolific research area in the last decade. Although several benchmarks are now available, especially on portals like CrisisLex, an important, practical problem that has not been addressed thus far is the rapid acquisition, benchmarking and visual exploration of data from free, publicly available streams like the Twitter API in the immediate aftermath of a crisis. In this paper, we present such a pipeline for facilitating immediate post-crisis data collection, curation and relevance filtering from the Twitter API. The pipeline is minimally supervised, alleviating the need for feature engineering by including a judicious mix of data preprocessing and fast text embeddings, along with an active learning framework. We illustrate the utility of the pipeline by describing a recent case study wherein it was used to collect and analyze millions of tweets in the immediate aftermath of the Las Vegas shootings in 2017.
... The platform is designed by focusing on availability, accessibility, and performance. In [31], user interface design for data analysis environments is provided with focusing on usability, scalability, reliability and efficiency. In [32], CyberGIS framework is designed to integrate multi-sourced with focusing on scalability and performance and providing real-time event tracing and mapping. ...
Conference Paper
Full-text available
In today’s digital world, we have been exposed by tremendous amounts of data generated by numerous sources and the developers of data intensive systems are confronted with the challenges of collecting, analyzing and storing large amounts of data. Dealing with those challenges and designing software systems provides demanded set of quality attributes require engaging with sophisticated approaches, developing clever techniques, and carefully making use of cutting edge technologies. In this paper, first, an outline of crisis informatics research and crisis management phases are introduced, next, an overview of quality attributes for data intensive systems is presented, last, a classification of frequently demanded quality attributes for crisis data intensive system is provided by taking into account crisis management phases and the type of data analytics performed.
... In this paper, we focus exclusively on the work we performed to enable real-time analysis of streaming social media data during times of crisis. There are, of course, additional challenges beyond the ones mentioned above; for instance, data intensive software systems require well-designed user interfaces to facilitate access to large data sets and to allow users to search, filter, sort, query, and analyze that data [3,6,9]. While we encountered these challenges when creating the IDCAP, we do not discuss them here. ...
... Finding the right data model for a given problem domain is critical to achieve fast and efficient queries [6,9]. The storage layer of the IDCAP contains the Event_Tweets and Event_Information column families; these column families are discussed in detail in [7]. ...
Conference Paper
Full-text available
Real-time data collection and analytics is a desirable but challenging feature to provide in data-intensive software systems. To provide highly concurrent and efficient real-time analytics on streaming data at interactive speeds requires a well-designed software architecture that makes use of a carefully selected set of software frameworks. In this paper, we report on the design and implementation of the Incremental Data Collection & Analytics Platform (IDCAP). The IDCAP provides incremental data collection and indexing in real-time of social media data; support for real-time analytics at interactive speeds; highly concurrent batch data processing supported by a novel data model; and a front-end web client that allows an analyst to manage IDCAP resources, to monitor incoming data in real-time, and to provide an interface that allows incremental queries to be performed on top of large Twitter datasets.
Article
Full-text available
Humanitarian disasters have been on the rise in recent years due to the effects of climate change and socio-political situations such as the refugee crisis. Technology can be used to best mobilize resources such as food and water in the event of a natural disaster, by semi-automatically flagging tweets and short messages as indicating an urgent need. The problem is challenging not just because of the sparseness of data in the immediate aftermath of a disaster, but because of the varying characteristics of disasters in developing countries (making it difficult to train just one system) and the noise and quirks in social media. In this paper, we present a robust, low-supervision social media urgency system that adapts to arbitrary crises by leveraging both labeled and unlabeled data in an ensemble setting. The system is also able to adapt to new crises where an unlabeled background corpus may not be available yet by utilizing a simple and effective transfer learning methodology. Experimentally, our transfer learning and low-supervision approaches are found to outperform viable baselines with high significance on myriad disaster datasets.