Thang Hoang Ta
University of Dalat

Doctor of Philosophy

About

Publications

3,189

Reads

Citations

I graduated with my Ph.D. from Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN). My research interests are about Natural Language Generation, Knowledge Bases, and Sentiment Analysis. I am also a web designer with more than 10 years of experience.

Skills and Expertise

Deep Learning

Natural Language Processing

Sentiment Analysis

Python

Information Technology

April 2008 - present

University of Dalat

Information Technology
Vietnam

Position

Lecturer

October 2012 - February 2015

Shinawatra University

Field of study

Information Technology

September 2003 - January 2008

University of Dalat

Field of study

Software Engineering

Publications

BSRBF-KAN: A combination of B-splines and Radial Basic Functions in Kolmogorov-Arnold Networks

Preprint

Full-text available

Jun 2024

Thang Hoang Ta

In this paper, we introduce BSRBF-KAN, a Kolmogorov Arnold Network (KAN) that combines Bsplines and radial basis functions (RBFs) to fit input vectors in data training. We perform experiments with BSRBF-KAN, MLP, and other popular KANs, including EfficientKAN, FastKAN, FasterKAN, and GottliebKAN over the MNIST and FashionMNIST datasets. BSRBF-KAN s...

DepressionEmo: A novel dataset for multilabel classification of depression emotions

Preprint

Full-text available

Jan 2024

Emotions are integral to human social interactions, with diverse responses elicited by various situational contexts. Particularly, the prevalence of negative emotional states has been correlated with negative outcomes for mental health, necessitating a comprehensive analysis of their occurrence and impact on individuals. In this paper, we introduce...

Self-Training from Self-Memory in Data-to-Text Generation

Preprint

Full-text available

Jan 2024

Thang Hoang Ta

This paper introduces a novel training model, self-training from self-memory (STSM) in data-to-text generation (DTG), allowing the model to self-train on subsets, including self-memory as outputs inferred directly from the trained models and/or the new data. The quality of self-memory is validated by two models, data-to-text (D2T) and text-to-data...

Transformer-Based Approaches to Sentiment Detection

Chapter

Jun 2023

The use of transfer learning methods is largely responsible for the present breakthrough in Natural Learning Processing (NLP) tasks across multiple domains. In order to solve the problem of sentiment detection, we examined the performance of four different types of well-known state-of-the-art transformer models for text classification. Models such...

Fig. 3. The sketch of how quad (Q1372810, P54, Q1893, P580) is mapped...

List of properties used for the experiments

Some basic statistics on different datasets.

Mapping Process for the Task: Wikidata Statements to Text as Wikipedia Sentences

Preprint

Full-text available

Oct 2022

Acknowledged as one of the most successful online cooperative projects in human society, Wikipedia has obtained rapid growth in recent years and desires continuously to expand content and disseminate knowledge values for everyone globally. The shortage of volunteers brings to Wikipedia many issues, including developing content for over 300 language...

WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions from Paragraphs

Article

Sep 2022

As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many Natural Language Processing (NLP) tasks, such as information retrieval, knowledge base building, machine translation, text classification, and text summarization. In this paper, we introduce WikiDes, a novel dataset to generate short descriptions of...

Figure 1: A random sample in WikiDes: The first paragraph (shown in a...

Figure 2: The distribution of paragraphs (blue) and descriptions...

Figure 3: The distribution of instances in the dataset.

Figure 4: The distribution of token positions of descriptions in...

WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions from Paragraphs

Preprint

Full-text available

Sep 2022

Multi-Task Learning for Detection of Aggressive and Violent Incidents from Social Media

Article

Full-text available

Sep 2022

In this paper, we participate in the task of Detection of Aggressive and Violent INCIdents from Social Media in Spanish (DA-VINCIS). We apply a multi-task learning network, MT-DNN to train users' tweets on their text embeddings from pre-trained transformer models. In the first subtask, we obtained the best F1 of 74.80%, Precision of 75.52%, and Rec...

GAN-BERT: Adversarial Learning for Detection of Aggressive and Violent Incidents from Social Media

Conference Paper

Full-text available

Sep 2022

In this paper, we address Subtask 1 of Detection of Aggressive and Violent INCIdents from Social Media in Spanish (DA-VINCIS). We introduced our method, using text embeddings from pre-trained transformer models for the training process by GAN-BERT, an adversarial learning architecture. Finally, we obtained F1 of 74.43%, Precision of 74.08%, and Rec...

Transfer Learning from Multilingual DeBERTa for Sexism Identification

Conference Paper

Full-text available

Sep 2022

In this paper, we address the Task 1 and Task 2 of the EXIST 2022 in detecting sexism in a broad sense, from ideological inequality, sexual violence, misogyny to other expressions that involve implicit sexist behaviours in social networks. We apply transfer learning from a pre-trained multilingual DeBERTa (mDeBERTa) model and its zero classificatio...

Paraphrase Identification: Lightweight Effective Methods Based Features from Pre-trained Models

Conference Paper

Full-text available

Sep 2022

In this paper, we work on Paraphrase Identification in Mexican Spanish (PAR-MEX) at the sentence level. We introduced two lightweight methods, linear regression and multilayer perceptron for training data on features, extracted from pre-trained models. A rule of thumb, pair similarity is used to filter noises in the positive examples. We obtained t...

GAN-BERT, an Adversarial Learning Architecture for Paraphrase Identification

Conference Paper

Full-text available

Sep 2022

In this paper, we address the task of Paraphrase Identification in Mexican Spanish (PAR-MEX) at sentence-level. We introduced our method, using text embeddings from pre-trained transformer models for the training process by GAN-BERT, an adversarial learning. We modified noises for the generator, which have a random rate and the same size of the hid...

THANGCIC at PoliticEs 2022: Term-based BERT for Extracting Political Ideology from Spanish Author Profiling

Conference Paper

Full-text available

Sep 2022

This paper presents our participation in the task of detecting gender, profession, and political ideology in tweets of Spanish users, in a binary and multi-class perspective. The task plays an important role in identifying political ideology of parties and politicians, especially new emerging ones. This may support relevant tasks to make prediction...

Accuracy and F1 values for all classification methods

Automatic Hate Speech Detection Using CNN Model and Word Embedding

Article

Full-text available

Jun 2022

Hatred spreading through the use of language on social media platforms and in online groups is becoming a well-known phenomenon. By comparing two text representations: bag of words (BoW) and pre-trained word embedding using GloVe, we used a binary classification approach to automatically process user contents to detect hate speech. The Naive Bayes...

The Combination of BERT and Data Oversampling for Relation Set Prediction

Article

Full-text available

May 2022

In this paper, we engage the Task 2 of the SMART Task 2021 challenge in predicting relations used to identify the correct answer of a given question. This is a subtask of Knowledge Base Question Answering (KBQA) and offers valuable insights for the development of KBQA systems. We introduce our method, combining BERT and data oversampling with text...

Mining Hidden Topics from Newspaper Quotations: The COVID-19 Pandemic

Chapter

Full-text available

Oct 2020

In this paper, we extract quotations from Al Jazeera’s news articles containing keywords related to the COVID-19 pandemic. We apply Latent Dirichlet allocation (LDA), coherence measures, and clustering algorithms to unsupervisedly explore latent topics from the dataset of about 3400 quotations to see how coronavirus impacts human beings. By combini...

PHÂN LOẠI TÊN THỂ LOẠI Ở WIKIPEDIA TIẾNG VIỆT

Article

Full-text available

Jun 2017

Thang Hoang Ta

Wikipedia nổi tiếng là một bách khoa toàn thư mở lớn nhất hiện nay với mục đích phổ cập kiến thức cho tất cả mọi người trên thế giới. Với việc áp dụng robot trong khâu tạo bài tự động, dự án tiếng Việt là một trong 13 dự án ngôn ngữ có hơn một triệu bài viết. Tuy nhiên, điều đó tạo cho Wikipedia tiếng Việt nhiều thách thức trong việc nâng cao chất...

Các phương pháp ánh xạ thuộc tính hộp thông tin Wikipedia đến Wikidata

Article

Full-text available

Dec 2015

Thang Hoang Ta

Wikidata là một cơ sở dữ liệu trực tuyến mở lưu trữ các tài nguyên chung của các dự án liên quan do tổ chức Wikimedia quản lý. Việc đồng nhất hóa các hộp thông tin (infobox) của Wikipedia được nêu trong kế hoạch giai đoạn 2 của Wikidata. Theo đó, các hộp thông tin sẽ được đồng nhất hóa để tránh tình trạng đa dạng dữ liệu giữa các dự án ngôn ngữ. Đồ...

A Model for Enriching Multilingual Wikipedias Using Infobox and Wikidata Property Alignment

Data

Jan 2015

Wikipedia supports a large converged data with millions of contributions in more than 287 languages currently. Its content changes rapidly and continuously every hour with thousands of edits which trigger many challenges for Wikipedia in controlling, associating and balancing article content among language editions. This paper provides some process...

Questions

Is multilingualism a future research trend of natural language processing?

Question

Mar 2018

I found that a lot of papers researching about monolingual or bilingual issues in NLP. Therefore, there are not many things to do now. Given a problem, many scholars try to open it to the general case by approaching multilingualism. Should we conclude that multilingualism will be the future of NLP?

Which professors in USA research about Wikipedia, Wikidata and DBpedia?

Question