Bo Han's research while affiliated with University of Melbourne and other places

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Text-Based Twitter User Geolocation Prediction

Article

March 2014

·

678 Reads

·

307 Citations

Journal of Artificial Intelligence Research

·

·

Timothy Baldwin

Geographical location is vital to geospatial applications like local search and event detection. In this paper, we investigate and improve on the task of text-based geolocation prediction of Twitter users. Previous studies on this topic have typically assumed that geographical references (e.g., gazetteer terms, dialectal words) in a text are indicative of its author’s location. However, these references are often buried in informal, ungrammatical, and multilingual data, and are therefore non-trivial to identify and exploit. We present an integrated geolocation prediction framework and investigate what factors impact on prediction accuracy. First, we evaluate a range of feature selection methods to obtain “location indicative words”. We then evaluate the impact of non-geotagged tweets, language, and user-declared metadata on geolocation prediction. In addition, we evaluate the impact of temporal variance on model generalisation, and discuss how users differ in terms of their geolocatability. We achieve state-of-the-art results for the text-based Twitter user geolocation task, and also provide the most extensive exploration of the task to date. Our findings provide valuable insights into the design of robust, practical text-based geolocation prediction systems.

Statistical Methods for Identifying Local Dialectal Terms from GPS-Tagged Documents

Article

January 2014

·

29 Reads

·

9 Citations

Dictionaries Journal of the Dictionary Society of North America

·

·

Timothy Baldwin

Corpora of documents whose metadata includes GPS coordinates have recently become widely available through online social media such as Twitter. This has created opportunities for statistical corpus methods that describe the geographical spread of words, but such techniques do not appear to be widely used in corpus linguistics and lexicography. This paper presents several methods for describing the spread of a set of points, corresponding to documents containing a given word and applies the methods to a corpus of GPS-tagged tweets from Twitter. In experiments on known regionalisms, we show that these methods could be used to help identify such expressions. We analyze the words in the corpus identified as having the most geographically restricted usage and identify some expressions that appear to be previously undocumented regionalisms with highly localized usage.

A Stacking-based Approach to Twitter User Geolocation Prediction

Conference Paper

August 2013

·

82 Reads

·

69 Citations

·

·

Timothy Baldwin

Lexical Normalization for Social Media Text

Article

February 2013

·

597 Reads

·

185 Citations

ACM Transactions on Intelligent Systems and Technology

·

·

Timothy Baldwin

Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this article, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalizing lexical variants. Our method uses a classifier to detect lexical variants, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn't require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.

Geolocation Prediction in Social Media Data by Finding Location Indicative Words

Conference Paper

December 2012

·

956 Reads

·

173 Citations

·

·

Timothy Baldwin

Geolocation prediction is vital to geospatial applications like localised search and local event detection. Predominately, social media geolocation models are based on full text data, including common words with no geospatial dimension (e.g. today) and noisy strings (tmrw), potentially hampering prediction and leading to slower/more memory-intensive models. In this paper, we focus on finding location indicative words (LIWs) via feature selection, and establishing whether the reduced feature set boosts geolocation accuracy. Our results show that an information gain ratiobased approach surpasses other methods at LIW selection, outperforming state-of-the-art geolocation prediction methods by 10.6% in accuracy and reducing the mean and median of prediction error distance by 45km and 209km, respectively, on a public dataset. We further formulate notions of prediction confidence, and demonstrate that performance is even higher in cases where our model is more confident, striking a trade-off between accuracy and coverage. Finally, the identified LIWs reveal regional language differences, which could be potentially useful for lexicographers.

Automatically constructing a normalisation dictionary for microblogs

Conference Paper

July 2012

·

135 Reads

·

150 Citations

·

·

Timothy Baldwin

Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to generate possible variant and normalisation pairs and then rank these by string similarity. Highly-ranked pairs are selected to populate the dictionary. We show that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset. Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing.

A support platform for event detection using social intelligence

Conference Paper

April 2012

·

35 Reads

·

16 Citations

Timothy Baldwin

·

·

·

[...]

·

Masud Moshtaghi

This paper describes a system designed to support event detection over Twitter. The system operates by querying the data stream with a user-specified set of keywords, filtering out non-English messages, and probabilistically geolocating each message. The user can dynamically set a probability threshold over the geolocation predictions, and also the time interval to present data for.

Melbourne Language Technology Group Microblog Track Report

Article

January 2011

·

25 Reads

·

·

Timothy Baldwin

Lexical Normalisation of Short Text Messages: Makn Sens a #twitter.

January 2011

·

1,513 Reads

·

449 Citations

·

Timothy Baldwin

Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn't require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.

... While there has been a wealth of work that has used Twitter data to explore lexical variation (e.g., Eisenstein et al. (2012Eisenstein et al. ( , 2014; Cook, Han, and Baldwin (2014); Doyle (2014); Jones (2015); Huang et al. (2016); Kulkarni, Perozzi, and Skiena (2016); Grieve, Nini, and Guo (2018)), the incorporation of distributional methods is a more recent trend. Huang et al. (2016) apply a count-based method to Twitter data to represent language use in counties across the United States. ...
Reference:
Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings

Statistical Methods for Identifying Local Dialectal Terms from GPS-Tagged Documents

Citing Article
January 2014

Dictionaries Journal of the Dictionary Society of North America

·

·

Timothy Baldwin

... Previous studies show possibilities of automatically inferring locations from location mentions in microblog text data. For example, [65] used classical machine tools to predict the geolocation of social media text data using location-indicative words through feature selection engineering. ...
Reference:
Augmenting web-based tourist support system with microblog analyzed data

Geolocation Prediction in Social Media Data by Finding Location Indicative Words

Citing Conference Paper
December 2012

·

·

Timothy Baldwin

... City-level geolocation prediction systems have been investigated as well. Han et al. [13] implemented a stacking approach combining tweet text and user-declared metadata, achieving higher accuracy compared to benchmark methods. Their study highlighted the impact of temporal factors on model generalization and discussed potential applications of the system. ...
Reference:
LocBERT: Improving Social Media User Location Prediction Using Fine-Tuned BERT

A Stacking-based Approach to Twitter User Geolocation Prediction

Citing Conference Paper
August 2013

·

·

Timothy Baldwin

... Their proposed combined dimension showed some improvement in prediction accuracy. [38] presented the user geolocation prediction framework using a generative Naïve Bayes model which estimates the joint probability of an observed word vector and a class. The authors preferred such an algorithm among others because of its simplicity of being easily retrained. ...
Reference:
Augmenting web-based tourist support system with microblog analyzed data

Text-Based Twitter User Geolocation Prediction

Citing Article
March 2014

Journal of Artificial Intelligence Research

·

·

Timothy Baldwin

... For English slang terms, we use the NoSlang dictionary. 6 For English, we use the dictionary developed byHan et al. (2012) consist of 41,181 tokens for micro-blog normalization task. Limited data of approx. ...
Reference:
Code‐mixed Hindi‐English text correction using fuzzy graph and word embedding

Automatically constructing a normalisation dictionary for microblogs

Citing Conference Paper
July 2012

·

·

Timothy Baldwin

... We consider that the limit of this method is that it fails to find localized events when no places are mentioned in the correspondent tweets. [27] proposes also an interactive system for event detection which operates on English messages. Precisely, users obtain tweets which are related to their queries at multiple granularities of time and space. ...
Reference:
Real-Time Event Localization and Detection Over Social Networks Using Apache Intelligence

A support platform for event detection using social intelligence

Citing Conference Paper
April 2012

Timothy Baldwin

·

·

·

[...]

·

Masud Moshtaghi

... Similarly, the authors of [2] suggested a technique that relies on a phonetic algorithm to make lexical variances in the Roman Urdu text more typical. The researchers from [2], [3], [4], [5], and [6] carried out a comparative analysis to evaluate how lexical variance is handled by normalization approaches in the text of multilingual social media postings, Including Roman Hindi, Dutch, Finnish, Arabic, Spanish, Bangla, Japanese, Chinese, and Polish. Among the several methods of normalizing is the Rule-based approach stemming and lemmatization, phonetic algorithms, machine learning algorithms, etc. ...
Reference:
Hate Speech Detection Using Deep Learning

Lexical Normalization for Social Media Text

Citing Article
February 2013

ACM Transactions on Intelligent Systems and Technology

·

·

Timothy Baldwin

... Out-of-vocabulary (OOV) words are prevalent in social media text, and they pose significant challenges [14]. Furthermore, the evolving nature of online language necessitates periodic model updates [6]. ...
Reference:
Enhancing User Experience in Chinese Initial Text Conversations with Personalised AI-Powered Assistant

Lexical Normalisation of Short Text Messages: Makn Sens a #twitter.

Citing Conference Paper
Full-text available
January 2011

·

Timothy Baldwin