Figure 1 - uploaded by Devin Francis Gaffney
Content may be subject to copyright.
Screenshot from Barack Obama’s Twitter profile page. 

Screenshot from Barack Obama’s Twitter profile page. 

Source publication
Article
Full-text available
The movements of ideas and content between locations and languages are unquestionably crucial concerns to researchers of the information age, and Twitter has emerged as a central, global platform on which hundreds of millions of people share knowledge and information. A variety of research has attempted to harvest locational and linguistic metadata...

Context in source publication

Context 1
... services such as Twitter allow researchers, marketers, activists and governments unprecedented access to digital trails of data as users share information and communicate online. Patterns of information exchange on platforms that rely on user-generated content have been used recently in scholarly research about community (Gruzd, Wellman, and Takhteyev 2011), information diffusion (Romero, Meeder, and Kleinberg 2011), politics (Bruns and Burgess 2011), religion (Shelton, Zook, and Graham 2013), crisis response (Zook et al. 2010; Palen et. al. 2011), and many other topics. Such data are also important to governments and marketers seeking to understand trends and patterns ranging from customer/citizen feedback to the mapping of health pandemics (Graham and Zook 2011). Twitter in particular with its large and international user base (there are now over 350 million users on the platform) has been the source of much scholarly research. Content passed through Twitter remains decontextualized, however, unless we find ways to reattach it to geography. In other words, we don’t just want to know what is said, but we also want to know where it is said and to whom it is said. As such, the attributes of language and location are crucial for understanding the geographies of online flows of information and the ways that they might reveal underlying economic, social, political and environmental trends and patterns. Yet, both language and location are challenging to deduce in the short messages that pass through Twitter, and no well accepted methodology for their extraction and analysis has been articulated. This point is especially salient because of the increasing number of studies, journalistic accounts, and real-world applications that rely on harvested locational and language data from Twitter. Therefore, in order to provide a useful starting point for future reach on Twitter (and indeed other micro-blogging platforms), this paper compares several approaches to working with geographic information in Twitter in order to better understand the strengths and limitations of each. The short size of posts (140 characters on Twitter) presents a challenge to accurate language identification due to the fact that most language identification algorithms are trained on larger sized documents (Carter, Tsagkias, and Weerkamp 2011). In addition, the style of writing on Twitter using abbreviations and acronyms complicates language classification. In many instances, researchers have simply relied on the user interface language of a user’s account or used an off-the-self language detection package without consideration of its suitability for use on short, informal text phrases. The disagreement of several studies on the most used languages in Twitter (Honeycutt and Herring 2009; Semiocast 2010; Hong, Convertino, and Chi 2011; Takhteyev, Gruzd, and Wellman 2011) highlights the difficulty of language detection . All four studies agree English is the most used language, but give percentages ranging from 50 percent (Semiocast 2010) to 72.5 percent (Takhteyev, Gruzd, and Wellman 2011). The purpose of our work is not to study the prominence of different languages on the platform, but is rather to highlight important methodological issues related to language identification in order for future research to more critically engage in geolinguistic analyses. Accurately determining location in messages sent through Twitter is also a significant challenge. The most apparent method is to consider the profile information that is directly provided by a user (e.g. the text “Washington, DC” in Figure 1) in response to an account set-up question: “Where in the world are you?” However, this question, which allows users to input any text string to describe their location (referred to in this paper as ‘profile location’), is often hard to geolocate correctly (the open-ended text could just as easily say “Edinburgh, Scotland,” “Barad- dûr, Mordor, Middle-earth,” or simply “here”). High error rates, missing data and non- standardized text in profile locations have forced some researchers wishing to employ this geographic data to use smaller samples and labor-intensive manual coding of profile locations (e.g. Takhteyev, Gruzd, and Wellman 2011). An alternate approach that some researchers have adopted is to narrow their samples to only use geocoded tweets. Depending on user’s privacy settings and the geolocation method used, these tweets have either an exact location specified as a pair of latitude and longitude coordinates or an approximate location specified as a rectangular bounding box. This type of geographic information (referred to in this paper as ‘device location’) represents the location of the machine or device that a user used to send a message on Twitter. More precisely, the data are derived from either the user’s device itself (using the Global Positioning System [GPS]) or by detecting the location of the user’s Internet Protocol (IP) address. Precise coordinates are almost certainly from devices with built-in GPS receivers (e.g. phones and tablets). Bounding boxes, however, can result from privacy settings applied to GPS data or from GeoIP data. Irrespective of these limitations, device locations are challenging for users to manually manipulate, and, because they are structured data, are easily interpreted by computers. However, only a small portion of users publish geocoded tweets, and it is unlikely that they form a representative sample of the broader universe of content (i.e. the division between geocoding and non-geocoding users is almost certainly biased by factors such as social-economic status, location, education, etc.). From a sample of 19.6 million tweets collected by the authors (these data were collected using Twitter’s ‘statuses/sample stream’ collection method with ‘spritzer access’) over nineteen days in June 2011, only 0.7 percent of tweets contained structured geolocation information. As such, the extremely low proportion of information with attached device locations means that researchers either have to work with data that are likely highly skewed or devise effective methods to work with the profile location that is attached to all of the tweets that do not contain explicitly geocoded device location information. This paper deals with these gaps of knowledge related to language and location in two primary ways. First, it explores the accuracy of a range of language detection methods on tweets: which, by, definition, are short and often contain informal phrasings and abbreviations. It identifies common sources of errors and compares performance over four research locations, each comprising a large variety of languages. Second, it compares various location information within tweets (profile location, device location, timezone information) and the accuracy with which geolocation algorithms can interpret the free-form profile location information. In performing this work, we are able to refine methods that can be employed to map and measure the geolinguistic contours of people’s information trails on Twitter. Doing so will ultimately allow future work to build on this research in order to create more accurate and nuanced understandings of the clouds of digital information that overlay our planet. A variety of methods have been employed in looking at Twitter’s geolinguistic contours. Hong et al. (2011) used two automated tools to determine the language of a tweet: LingPipe and the Google Language Application Programming Interface (API), while Semiocast (2010) use an internal proprietary tool. Carter et al. (2011) and Gottron and Lipka (2010) discuss several of the challenges with language identification on short texts, the largest being that most language detection algorithms have been developed and trained on full documents that are longer and better formulated than the short text snippets that pass through Twitter. Carter et al. (2011) focus on microblog posts and develop two approaches (priors) to enhance performance: a link-based approach to consider the language of linked-to content and a blogger-based approach to aggregate tweets on a per account basis to form a larger document to classify. They find both approaches improve accuracy, but still leave room for further improvement. Hale (2012a) used the Compact Language Detection (CLD) kit, part of Google Chrome, for detecting the language of blogs in conjunction with the presence of certain keywords. He found these two methods in combination to be 95 percent accurate on a sample of 965 blogs about the Haitian ...

Similar publications

Conference Paper
Full-text available
How does political discourse spread in digital networks? Can we empirically test if certain conceptual frames of social movements have a correlate on their online discussion networks? Through an analysis of the Twitter data from the Occupy movement, this paper describes the formation of political discourse over time. Building on a previous set of c...
Article
Full-text available
Micro-blogging through Twitter has made information short and to the point, and more importantly systematically searchable. This work is the first of a series in which quotidian observations about Tunisia are obtained using the micro-blogging site Twitter. Data was extracted using the open source Twitter API v1.1. Specific tweets were obtained usin...
Conference Paper
Full-text available
People use microblogging platforms like Twitter to involve with other users for a wide range of interests and practices. Twitter profiles run by different types of users such as humans, bots, spammers, businesses and professionals. This research work identifies six broad classes of Twitter users, and employs a supervised machine learning approach w...
Article
Full-text available
We present a new algorithm for inferring the home location of Twitter users at different granularities, including city, state, time zone or geographic region, using the content of users tweets and their tweeting behavior. Unlike existing approaches, our algorithm uses an ensemble of statistical and heuristic classifiers to predict locations and mak...
Article
Information garnered from activity on location-based social networks can be harnessed to characterize urban spaces and organize them into neighborhoods. In this work, we adopt a data-driven approach to the identification and modeling of urban neighborhoods using location-based social networks. We represent geographic points in the city using spatio...

Citations

... Third, it facilitates researchers, practitioners, and policymakers to collect information regarding the target audience's response and attitude toward a particular issue or subject, such as products, and political or election campaigns, including many other day-to-day issues or topics. Several researchers applied sentiment analysis to various fields of study, including medical and health studies [24], web and digital science [23], political studies [54], financial analysis [55], marketing and advertising [56], and HRM studies [6]. ...
Article
Full-text available
The COVID-19 pandemic has forced organisations to evaluate whether work from home (WFH) best fits future office management and employee productivity. The increasing popularity of web-based social media increases the possibility of using employees’ sentiment and opinion-mining techniques to track and monitor their preferences for WFH through Twitter. While social media platforms provide useful data-mining information about employee opinions, more research must be conducted to investigate the sentiment on Twitter of WFH employees. This paper meets this research demand by analysing a random sample of 755,882,104 tweets linked to employees’ opinions and beliefs regarding WFH. Moreover, an analysis of Google trends revealed a positive sentiment toward WFH. The results of this paper explore whether people (as employees) are enthusiastic and optimistic about WFH. This paper suggests that WFH has positive and supportive potential as an HRM strategy to increase workplace effectiveness for greater staff engagement and organisational sustainability.
... Social media usage coupled with Global Navigation Satellite Systems (GNSS)-enabled portable devices has become an indispensable part of daily life worldwide, which turns every user into a sensor capturing a direct snapshot of human activities at different places . Therefore, various groups like the government, non-governmental organizations (NGO), companies, and researchers have started using such user-generated data to access the information flow among societies and explore its applications in multiple fields, e.g., crisis management, disaster relief, political sway, and religious and economic trends (Graham et al., 2014;Kryvasheyeu et al., 2016). ...
... Zhai et al. (2020) However, these studies were limited by a lack of data. Less than 1% of data collected from social media are geotagged as many users turn off location-sharing due to privacy concerns (Graham et al., 2014). Meanwhile, social media companies have implemented more strict policies for accessing and sharing users' location data. ...
Article
The rapid development of information and communications technology has turned individuals into sensors, fostering the growth of human-generated geospatial big data. In disaster management, geospatial big data, mainly social media data, have opened new avenues for observing human responses to disasters in near real-time. Previous research relies on geographical information in geotags, content, and user profiles to locate social media messages. However, less than 1% of users geotag their messages, leaving geolocating users through user profiles or message content addresses increasingly crucial. This paper evaluates and visualizes the margin of error incurred when using user profiles or message-mentioned addresses to geolocate social media data for disaster research. Using Twitter data during the 2017 Hurricane Harvey as an example, this research assessed the inconsistencies in predicting users’ locations in various administrative units during each disaster phase using three geolocating strategies. The results reveal that the similarities between geotags, and user profile locations decrease from 94.07% to 64.56%, 43.9%, 31.82%, 27.05%, and 26.7% as the geographical scale changes from country to state, county, block group, 1-kilometer, and 30-meter levels. These similarities are overall higher than the agreements between locations derived from geotags and tweet content. The geolocation consistencies among the three methods remain stable across disaster phases. The impacts of uncertainties in geolocating Twitter data for disaster management applications were further unraveled. The findings offer valuable insights into the trade-off between spatial scale and geolocation accuracy and inform the selection of appropriate scales when applying different geolocating strategies in future social media-based investigations.
... Even as the number of research studies using digital data rapidly grows, relatively few have transparently outlined their data collection and analysis methods. Gradually, researchers have begun to critically examine the assumptions behind social media data findings, reproducibility, generalizability, and representativeness and call for higher transparency in documenting methods for such studies (Assenmacher et al., 2022;boyd & Crawford, 2012;Bruns, 2013; Center for Democracy & Technology n.d.; Cockburn et al., 2020;Council for Big Data, Ethics, and Society, n.d.;Fairness, Accountability, and Transparency in Machine Learning, n.d.;Fineberg et al., 2020;González-Bailón et al., 2014;Goroff, 2015;Graham et al., 2013;Jurgens et al., 2015;Reed & boyd, 2016;Tufekci, 2014). ...
Article
Full-text available
Social media dominate today’s information ecosystem and provide valuable information for social research. Market researchers, social scientists, policymakers, government entities, public health researchers, and practitioners recognize the potential for social data to inspire innovation, support products and services, characterize public opinion, and guide decisions. The appeal of mining these rich datasets is clear. However, there is potential risk of data misuse, underscoring an equally huge and fundamental flaw in the research: there are no procedural standards and little transparency. Transparency across the processes of collecting and analyzing social media data is often limited due to proprietary algorithms. Spurious findings and biases introduced by artificial intelligence (AI) demonstrate the challenges this lack of transparency poses for research. Social media research remains a virtual “wild west,” with no clear standards for reporting regarding data retrieval, preprocessing steps, analytic methods, or interpretation. Use of emerging generative AI technologies to augment social media analytics can undermine validity and replicability of findings, potentially turning this research into a “black box” enterprise. Clear guidance for social media analyses and reporting is needed to assure the quality of the resulting research. In this article, we propose criteria for evaluating the quality of studies using social media data, grounded in established scientific practice. We offer clear documentation guidelines to ensure that social data are used properly and transparently in research and applications. A checklist of disclosure elements to meet minimal reporting standards is proposed. These criteria will make it possible for scholars and practitioners to assess the quality, credibility, and comparability of research findings using digital data.
... A global mapping of Twitter usage discussing the geography of Twitter and the distribution of geolocated tweets is presented in [33]. Graham et al. (2014) [34] analyzed the geolocation and language identification of tweets, showing the spatial and linguistic distribution of Twitterers worldwide. ...
... A global mapping of Twitter usage discussing the geography of Twitter and the distribution of geolocated tweets is presented in [33]. Graham et al. (2014) [34] analyzed the geolocation and language identification of tweets, showing the spatial and linguistic distribution of Twitterers worldwide. ...
... Our assumption that the geographical origin of a user is a strong predictor of the dialect they use in their writing is a crucial part of our approach to building the corpus [34,36,37]. The relationship between language and geography has been a topic of interest to linguists since the nineteenth century [36]. ...
Article
Full-text available
In this study, we present the acquisition and categorization of a geographically-informed, multi-dialectal Albanian National Corpus, derived from Twitter data. The primary dialects from three distinct regions—Albania, Kosovo, and North Macedonia—are considered. The assembled publicly available dataset encompasses anonymized user information, user-generated tweets, auxiliary tweet-related data, and annotations corresponding to dialect categories. Utilizing a highly automated scraping approach, we initially identified over 1,000 Twitter users with discernible locations who actively employ at least one of the targeted Albanian dialects. Subsequent data extraction phases yielded an augmentation of the preliminary dataset with an additional 1,500 Twitterers. The study also explores the application of advanced geotagging techniques to expedite corpus generation. Alongside experimentation with diverse classification methodologies, comprehensive feature engineering and feature selection investigations were conducted. A subjective assessment is conducted using human annotators, which demonstrates that humans achieve significantly lower accuracy rates in comparison to machine learning (ML) models. Our findings indicate that machine learning algorithms are proficient in accurately differentiating various Albanian dialects, even when analyzing individual tweets. A meticulous evaluation of the most salient attributes of top-performing algorithms provides insights into the decision-making mechanisms utilized by these models. Remarkably, our investigation revealed numerous dialectal patterns that, despite being familiar to human annotators, have not been widely acknowledged within the broader scientific community.
... User profile locations are suitable for large-scale, community-based (e.g. city or county level) analysis of user behaviors, as around 55% of social media data can be correctly associated with a city based on addresses in user profiles (Graham et al., 2014). Alternatively, addresses mentioned in social media content can be used. ...
Article
Full-text available
Quantitative assessment of community resilience can provide support for hazard mitigation, disaster risk reduction, disaster relief, and long-term sustainable development. Traditional resilience assessment tools are mostly theory-driven and lack empirical validation , which impedes scientific understanding of community resilience and practical decision-making of resilience improvement. In the advent of the Big Data Era, the increasing data availability and advances in computing and modeling techniques offer new opportunities to understand, measure, and promote community resilience. This article provides a comprehensive review of the definitions of community resilience, along with the traditional and emerging data and methods of quantitative resilience measurement. The theoretical bases, modeling principles, advantages, and disadvantages of the methods are discussed. Finally, we point out research avenues to overcome the existing challenges and develop robust methods to measure and promote community resilience. This article establishes guidance for scientists to further advance disaster research and for planners and policymakers to design actionable tools to develop sustainable and resilient communities.
... Social media usage coupled with Global Navigation Satellite Systems (GNSS)-enabled portable devices has become an indispensable part of daily life worldwide, which turns every user into a sensor capturing a direct snapshot of human activities at different places . Therefore, various groups like the government, non-governmental organizations (NGO), companies, and researchers have started using such user-generated data to access the information flow among societies and explore its applications in multiple fields, e.g., crisis management, disaster relief, political sway, and religious and economic trends (Graham et al., 2014;Kryvasheyeu et al., 2016). ...
... Zhai et al. (2020) However, these studies were limited by a lack of data. Less than 1% of data collected from social media are geotagged as many users turn off location-sharing due to privacy concerns (Graham et al., 2014). Meanwhile, social media companies have implemented more strict policies for accessing and sharing users' location data. ...
Preprint
Full-text available
The emergence and rapid development of information and communications technology (ICT) have turned individuals into sensors, fostering the growth of human-generated geospatial big data. These geospatial big data, sometimes referred to as social sensing data, have been coupled with traditional spatial data and applied in various domains to understand society and the environment at multiple spatial and temporal scales. In disaster management, social sensing data, mainly social media data, have opened new avenues for observing human responses and behaviors under disasters in near real-time. Previous research relies on geographical information in geotags, content, and user profiles to locate social media messages. However, less than 1% of users share their locations through geotagging, leaving geolocating users through addresses in user profiles or the message content increasingly crucial in future location-based social media analysis and applications. This paper attempts to evaluate and visualize the margin of error incurred when using user profile locations or message-mentioned addresses to geolocate social media data for disaster research. Using Twitter data during the 2017 Hurricane Harvey as a case study, this research assessed the inconsistencies in predicting users' locations in various administrative units during each disaster phase using three geolocating strategies. The study provides insights into the relationship of spatial scales and conflated geolocating strategies, with the 'user from' method achieving the highest agreement at the country level (94.07%) compared to the 'tweet about' method. Interestingly, the findings indicate that geolocation accuracy remains relatively stable across the three disaster recovery phases. Moreover, during the preparedness phase, the agreement percentages between the 'tweet from' and 'user from' locations reach their peak, ranging from 95.1% at the country level to 40.3% at the county level. The paper rigorously quantifies the uncertainties associated with conflating geolocating methods for Twitter data for disaster management applications, underscoring the importance of accepting an appropriate level of uncertainty or using the three geolocating methods separately in future social media-based investigations. Furthermore, the study quantifies the trade-off between spatial scale and geolocation accuracy, revealing a decline in agreement between two geolocating methods as the geographical scale transitions from state to county, block group, 1-kilometer, and 30-meter levels. The potential impacts of uncertainties in geolocating Twitter data for disaster management applications were further unraveled. The findings offer valuable insights into selecting appropriate scales when applying different geolocating strategies in future social media-based investigations.
... For instance, geotagging text data on the web has been approached through geometrical methods [25], offering an alternative to relying solely on explicit geotags. The potential of Twitter data were highlighted both in determining the geographic origin of user-generated content [11], and developing predictive models that can estimate the location of Twitter users based on their posted content [36]. Moreover, mining Twitter data offered a better understanding of disaster resilience [37], while was also proven effective in event classification and location prediction during disasters [30]. ...
... Thus, we proceed to create candidate toponyms using baseline NER models. Pre-annotations are provided by two off-the-shelf NER models: Spacy en_core_web_md 9 , a pre trained pipeline for English that includes NER components; and the roBERTa xlm-roberta-basewikiann-ner 10 , a multilingual roBERTa based NER model finetuned on 20 annotated Wikipedia datasets 11 . The pre-annotations are the results of the union of the predictions of the 2 models. ...
... Twitter will not show any location information unless the user has opted in and has allowed his/her device or browser to transmit the coordinates to it. 3 As a matter of fact, only a small portion of tweets are geotagged, less than 1% of its users (Hale et al., 2012). However, since it is important to know where a tweet came from in many Twitter studies to investigate regional user behaviour, many approaches have been proposed for the geolocation task. ...
... This approach applies natural language processing methods on the text of a tweet to predict user location by leveraging words indicative of locality, for example, by being more commonly used in certain regions. Due to the unstructured nature of the data and the general complexity of the problem, geolocation methods using tweet content employ a wide range of techniques, ranging from maximum likelihood approaches to machine learning/deep learning models, both supervised and unsupervised (Cheng et al., 2010;Chandra et al., 2011;Wing and Baldridge, 2011;Roller et al., 2012;Han et al., 2013Han et al., , 2014Graham et al., 2014;Onan, 2017;Hoang and Mothe, 2018). Obviously, geolocation methods can also combine tweet content, including photos (Matsuo et al., 2017), with network data and metadata to achieve better results (Ren et al., 2012;Elmongui et al., 2015;Miura et al., 2017;Bakerman et al., 2018;Ribeiro and Pappa, 2018;Tian et al., 2020). ...
Book
Full-text available
This is the full ebook of our Research Topic on "Big Data and Machine Learning in Sociology", published in "Frontiers in Sociology". It contains 10 articles from several authors and our editorial. Further information on the Research Topic and the articles is available here: https://www.frontiersin.org/research-topics/23160/big-data-and-machine-learning-in-sociology
... Language identification Twitter posts present a challenge to automated language identification (LangID) due to their short length, informal style, and lack of ground truth labels (Graham, Hale, and Gaffney 2014;Williams and Dagli 2017). We use Twitter's built-in language detector because it is computationally efficient for a massive dataset, requires few additional resources, and is trained on in-domain data. 1 We validate our decision with a comparison to 5 popular LangID packages: langdetect 2 , langid.py ...
... (Lui and Baldwin 2012), and CLD2 3 use probabilistic models, while fastText (Joulin et al. 2016) and CLD3 4 use neural networks. As in prior LangID evaluations, we randomly sample 1K tweets from 32 countries written in that country's dominant language, as labeled by Twitter (Graham, Hale, and Gaffney 2014;Lamprinidis et al. 2021). Like Graham, Hale, and Gaffney (2014), we calculate intercoder agreement between all pairs of models. ...
... As in prior LangID evaluations, we randomly sample 1K tweets from 32 countries written in that country's dominant language, as labeled by Twitter (Graham, Hale, and Gaffney 2014;Lamprinidis et al. 2021). Like Graham, Hale, and Gaffney (2014), we calculate intercoder agreement between all pairs of models. Table S3 (Supplemental Material) shows that Twitter's LangID has a high agreement with other models, even at higher rates than they agree with each other. ...
Article
Social media enables the rapid spread of many kinds of information, from pop culture memes to social movements. However, little is known about how information crosses linguistic boundaries. We apply causal inference techniques on the European Twitter network to quantify the structural role and communication influence of multilingual users in cross-lingual information exchange. Overall, multilinguals play an essential role; posting in multiple languages increases betweenness centrality by 13%, and having a multilingual network neighbor increases monolinguals’ odds of sharing domains and hashtags from another language 16-fold and 4-fold, respectively. We further show that multilinguals have a greater impact on diffusing information is less accessible to their monolingual compatriots, such as information from far-away countries and content about regional politics, nascent social movements, and job opportunities. By highlighting information exchange across borders, this work sheds light on a crucial component of how information and ideas spread around the world.