Bo Han's research while affiliated with University of Melbourne and other places

What is this page?


This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Publications (9)


Text-Based Twitter User Geolocation Prediction
  • Article

March 2014

·

678 Reads

·

307 Citations

Journal of Artificial Intelligence Research

Bo Han

·

Paul Cook

·

Geographical location is vital to geospatial applications like local search and event detection. In this paper, we investigate and improve on the task of text-based geolocation prediction of Twitter users. Previous studies on this topic have typically assumed that geographical references (e.g., gazetteer terms, dialectal words) in a text are indicative of its author’s location. However, these references are often buried in informal, ungrammatical, and multilingual data, and are therefore non-trivial to identify and exploit. We present an integrated geolocation prediction framework and investigate what factors impact on prediction accuracy. First, we evaluate a range of feature selection methods to obtain “location indicative words”. We then evaluate the impact of non-geotagged tweets, language, and user-declared metadata on geolocation prediction. In addition, we evaluate the impact of temporal variance on model generalisation, and discuss how users differ in terms of their geolocatability. We achieve state-of-the-art results for the text-based Twitter user geolocation task, and also provide the most extensive exploration of the task to date. Our findings provide valuable insights into the design of robust, practical text-based geolocation prediction systems.

Share

Statistical Methods for Identifying Local Dialectal Terms from GPS-Tagged Documents

January 2014

·

29 Reads

·

9 Citations

Dictionaries Journal of the Dictionary Society of North America

Corpora of documents whose metadata includes GPS coordinates have recently become widely available through online social media such as Twitter. This has created opportunities for statistical corpus methods that describe the geographical spread of words, but such techniques do not appear to be widely used in corpus linguistics and lexicography. This paper presents several methods for describing the spread of a set of points, corresponding to documents containing a given word and applies the methods to a corpus of GPS-tagged tweets from Twitter. In experiments on known regionalisms, we show that these methods could be used to help identify such expressions. We analyze the words in the corpus identified as having the most geographically restricted usage and identify some expressions that appear to be previously undocumented regionalisms with highly localized usage.



Lexical Normalization for Social Media Text

February 2013

·

597 Reads

·

185 Citations

ACM Transactions on Intelligent Systems and Technology

Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this article, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalizing lexical variants. Our method uses a classifier to detect lexical variants, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn't require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.


Geolocation Prediction in Social Media Data by Finding Location Indicative Words

December 2012

·

956 Reads

·

173 Citations

Geolocation prediction is vital to geospatial applications like localised search and local event detection. Predominately, social media geolocation models are based on full text data, including common words with no geospatial dimension (e.g. today) and noisy strings (tmrw), potentially hampering prediction and leading to slower/more memory-intensive models. In this paper, we focus on finding location indicative words (LIWs) via feature selection, and establishing whether the reduced feature set boosts geolocation accuracy. Our results show that an information gain ratiobased approach surpasses other methods at LIW selection, outperforming state-of-the-art geolocation prediction methods by 10.6% in accuracy and reducing the mean and median of prediction error distance by 45km and 209km, respectively, on a public dataset. We further formulate notions of prediction confidence, and demonstrate that performance is even higher in cases where our model is more confident, striking a trade-off between accuracy and coverage. Finally, the identified LIWs reveal regional language differences, which could be potentially useful for lexicographers.


Automatically constructing a normalisation dictionary for microblogs

July 2012

·

135 Reads

·

150 Citations

Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to generate possible variant and normalisation pairs and then rank these by string similarity. Highly-ranked pairs are selected to populate the dictionary. We show that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset. Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing.


A support platform for event detection using social intelligence

April 2012

·

35 Reads

·

16 Citations

·

Paul Cook

·

Bo Han

·

[...]

·

This paper describes a system designed to support event detection over Twitter. The system operates by querying the data stream with a user-specified set of keywords, filtering out non-English messages, and probabilistically geolocating each message. The user can dynamically set a probability threshold over the geolocation predictions, and also the time interval to present data for.



Figure 1: Out-of-vocabulary word distribution in English Gigaword (NYT), Twitter and SMS data
Figure 2: Ill-formed word detection precision, recall and F-score
Lexical Normalisation of Short Text Messages: Makn Sens a #twitter.
  • Conference Paper
  • Full-text available

January 2011

·

1,513 Reads

·

449 Citations

Twitter provides access to large volumes of data in real time, but is notoriously noisy, hampering its utility for NLP. In this paper, we target out-of-vocabulary words in short text messages and propose a method for identifying and normalising ill-formed words. Our method uses a classifier to detect ill-formed words, and generates correction candidates based on morphophonemic similarity. Both word similarity and context are then exploited to select the most probable correction candidate for the word. The proposed method doesn't require any annotations, and achieves state-of-the-art performance over an SMS corpus and a novel dataset based on Twitter.

Download

Citations (8)


... While there has been a wealth of work that has used Twitter data to explore lexical variation (e.g., Eisenstein et al. (2012Eisenstein et al. ( , 2014; Cook, Han, and Baldwin (2014); Doyle (2014); Jones (2015); Huang et al. (2016); Kulkarni, Perozzi, and Skiena (2016); Grieve, Nini, and Guo (2018)), the incorporation of distributional methods is a more recent trend. Huang et al. (2016) apply a count-based method to Twitter data to represent language use in counties across the United States. ...

Reference:

Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings
Statistical Methods for Identifying Local Dialectal Terms from GPS-Tagged Documents
  • Citing Article
  • January 2014

Dictionaries Journal of the Dictionary Society of North America

... Previous studies show possibilities of automatically inferring locations from location mentions in microblog text data. For example, [65] used classical machine tools to predict the geolocation of social media text data using location-indicative words through feature selection engineering. ...

Geolocation Prediction in Social Media Data by Finding Location Indicative Words
  • Citing Conference Paper
  • December 2012

... City-level geolocation prediction systems have been investigated as well. Han et al. [13] implemented a stacking approach combining tweet text and user-declared metadata, achieving higher accuracy compared to benchmark methods. Their study highlighted the impact of temporal factors on model generalization and discussed potential applications of the system. ...

A Stacking-based Approach to Twitter User Geolocation Prediction
  • Citing Conference Paper
  • August 2013

... Their proposed combined dimension showed some improvement in prediction accuracy. [38] presented the user geolocation prediction framework using a generative Naïve Bayes model which estimates the joint probability of an observed word vector and a class. The authors preferred such an algorithm among others because of its simplicity of being easily retrained. ...

Text-Based Twitter User Geolocation Prediction
  • Citing Article
  • March 2014

Journal of Artificial Intelligence Research

... We consider that the limit of this method is that it fails to find localized events when no places are mentioned in the correspondent tweets. [27] proposes also an interactive system for event detection which operates on English messages. Precisely, users obtain tweets which are related to their queries at multiple granularities of time and space. ...

A support platform for event detection using social intelligence
  • Citing Conference Paper
  • April 2012

... Similarly, the authors of [2] suggested a technique that relies on a phonetic algorithm to make lexical variances in the Roman Urdu text more typical. The researchers from [2], [3], [4], [5], and [6] carried out a comparative analysis to evaluate how lexical variance is handled by normalization approaches in the text of multilingual social media postings, Including Roman Hindi, Dutch, Finnish, Arabic, Spanish, Bangla, Japanese, Chinese, and Polish. Among the several methods of normalizing is the Rule-based approach stemming and lemmatization, phonetic algorithms, machine learning algorithms, etc. ...

Lexical Normalization for Social Media Text
  • Citing Article
  • February 2013

ACM Transactions on Intelligent Systems and Technology