Conference PaperPDF Available

fastHan: A BERT-based Multi-Task Toolkit for Chinese NLP

January 2021

January 2021

DOI:10.18653/v1/2021.acl-demo.12

Conference: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

Authors:

Zhichao Geng

Fudan University

Xipeng Qiu

Fudan University

Xuanjing Huang

Fudan University

Architecture of the proposed model. The inputs are characters embeddings.

…

This diagram explains the replacement strategy when using Theseus method. When training the base model, we randomly replace the layer of base model with corresponding layers of large model. The red arrows and yellow arrows represent two possible data paths during training.

…

The workflow of fastHan. As indicated by the yellow arrows, data is converted between various formats in each stage. The blue arrows reveal that fastHan needs to act according to the task being performed currently.

…

The results of fastHan's accuracy result. The score of CWS is the average of 10 corpora. When training dependency parsing separately, the biaffine parser use the same architecture as Yan et al. (2020). SOTA models are best-performing work we know for each task. They came from Huang et al. (2019), Yan et al. (2020), Meng et al. (2019), Li et al. (2020) in order. Li et al. (2020) uses lexicon to enhance the model.

…

Figures - uploaded by Zhichao Geng

Content may be subject to copyright.

Content uploaded by Zhichao Geng

Content may be subject to copyright.

Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th

International Joint Conference on Natural Language Processing: System Demonstrations, pages 99–106, August 1st - August 6th, 2021.

fastHan: A BERT-based Multi-Task Toolkit for Chinese NLP

Zhichao Geng, Hang Yan, Xipeng Qiu∗

, Xuanjing Huang

School of Computer Science, Fudan University

Key Laboratory of Intelligent Information Processing, Fudan University

{zcgeng20,hyan19,xpqiu,xjhuang}@fudan.edu.cn

Abstract

We present fastHan, an open-source toolkit

for four basic tasks in Chinese natural lan-

guage processing: Chinese word segmenta-

tion (CWS), Part-of-Speech (POS) tagging,

named entity recognition (NER), and depen-

dency parsing. The backbone of fastHan is

a multi-task model based on a pruned BERT,

which uses the ﬁrst 8 layers in BERT. We

also provide a 4-layer base model compressed

from the 8-layer model. The joint-model is

trained and evaluated on 13 corpora of four

tasks, yielding near state-of-the-art (SOTA)

performance in dependency parsing and NER,

achieving SOTA performance in CWS and

POS. Besides, fastHan’s transferability is also

strong, performing much better than popular

segmentation tools on a non-training corpus.

To better meet the need of practical applica-

tion, we allow users to use their own labeled

data to further ﬁne-tune fastHan. In addition

to its small size and excellent performance,

fastHan is user-friendly. Implemented as a

python package, fastHan isolates users from

the internal technical details and is convenient

to use. The project is released on Github1.

1 Introduction

Recently, the need for Chinese natural language

processing (NLP) has a dramatic increase for many

downstream applications. There are four basic

tasks for Chinese NLP: Chinese word segmenta-

tion (CWS), Part-of-Speech (POS) tagging, named

entity recognition (NER), and dependency pars-

ing. CWS is a character-level task while others are

word-level tasks. These basic tasks are usually the

cornerstones or provide useful features for other

downstream tasks.

However, the Chinese NLP community lacks an

effective toolkit utilizing the correlation between

∗Corresponding author

1https://github.com/fastnlp/fastHan

the tasks. Tools developed for a single task cannot

achieve the highest accuracy, and loading tools for

each task will take up more memory. In practical,

there is a strong correlation between these four ba-

sic Chinese NLP tasks. For example, the model will

perform better in the other three word-level tasks if

its word segmentation ability is stronger. Recently,

Chen et al. (2017a) adopt cross-label to label the

POS so that POS tagging and CWS can be trained

jointly. Yan et al. (2020) propose a graph-based

model for joint CWS and dependency parsing, in

which a special ”APP” dependency arc is used to

indicate the word segmentation information. Thus,

they can jointly train the word-level dependency

parsing task and character-level CWS task with the

biafﬁne parser (Dozat and Manning,2016). Chen

et al. (2017b) explore adversarial multi-criteria

learning for CWS, proving more knowledge can be

mined through training model on more corpora. As

a result, there are many pieces of research on how

to perform multi-corpus training on these tasks and

how to conduct multi-task joint training. Zhang

et al. (2020) show the joint training of POS tagging

and dependency parsing can improve each other’s

performance and so on. Results of the CWS task

are contained in the output of the POS tagging task.

Therefore, we developed fastHan, an efﬁcient

toolkit with the help of multi-task learning and pre-

trained models (PTMs) (Qiu et al.,2020). FastHan

adopts a BERT-based (Devlin et al.,2018) joint-

model on 13 corpora to address the above four tasks.

Through multi-task learning, fastHan shares knowl-

edge among the different corpora. This shared

information can improve fastHan’s performance on

these tasks. Besides, training on more corpora can

obtain a larger vocabulary, which can reduce the

number of times the model encounters characters

outs of vocabulary. What’s more, the joint-model

can greatly reduce the occupied memory space.

Compared with training a model for each task, the

100

joint-model can reduce the occupied memory space

by four times.

FastHan has two versions of the backbone model,

base and large. The large model uses the ﬁrst eight

layers of BERT, and the base model uses the The-

seus strategy (Xu et al.,2020) to compress the large

model to four layers. To improve the performance

of the model, fastHan has done much optimization.

For example, using the output of POS tagging to

improve the performance of the dependency pars-

ing task, using Theseus strategy to improve the

performance of the base version model, and so on.

Overall, fastHan has the following advantages:

Small size:

The total parameter of the base model

is 151MB, and for the large model the number

is 262MB.

High accuracy:

The base version of the model

achieved good results in all tasks, while the

large version of the model approached SOTA

in dependency parsing and NER, and achieved

SOTA performance in CWS and POS.

Strong transferability:

Multi-task learning al-

lows fastHan to adapt to multiple criteria, and

a large number of corpus allows fastHan to

mine knowledge from rare samples. As a re-

sult, fastHan is robust to new samples. Our

experiments in section 4.2 show fastHan out-

performs popular segmentation tools on non-

training dataset.

Easy to use:

FastHan is implemented as a python

package, and users can get started with its

basic functions in one minute. Besides, all

advanced features, such as user lexicon and

ﬁne-tuning, only need one line of code to use.

For developers of downstream applications, they

do not need to do repetitive work for basic tasks

and do not need to understand complex codes like

BERT. Even if users have little knowledge of deep

learning, by using fastHan they can get the re-

sults of SOTA performance conveniently. Also,

the smaller size can reduce the need for hardware,

so that fastHan can be deployed on more platforms.

For the Chinese NLP research community, the

results of fastHan can be used as a uniﬁed prepro-

cessing standard with high quality.

Besides, the idea of fastHan is not restricted to

Chinese. Applying multi-task learning to enhance

NLP toolkits also has practical value in other lan-

guages.

Figure 1: Architecture of the proposed model. The in-

puts are characters embeddings.

2 Backbone Model

The backbone of fastHan is a joint-model based

on BERT, which performs multi-task learning on

13 corpora of the four tasks. The architecture of

the model is shown in Figure 1. For this model,

sentences of different tasks are ﬁrst added with

corpus tags at the beginning of the sentence. And

then the sentences are input into the BERT-based

encoder and the decoding layer. The decoding layer

will use different decoders according to the current

task: use conditional random ﬁeld (CRF) to decode

in the NER task; use MLP and CRF to decode

in POS tagging and CWS task; use the output of

POS tagging task combined with biafﬁne parser to

decode in dependency parsing task.

Each task uses independent label sets here, CWS

uses label set

Y={B, M , E, S}

; POS tagging

uses cross-labels set based on

{B, M , E, S}

; NER

uses cross-labels set based on

{B, M , E, S, O}

;

dependency parsing uses arc heads and arc labels

to represent dependency grammar tree.

2.1 BERT-based feature extraction layer

BERT (Devlin et al.,2018) is a language model

trained in large-scale corpus. The pre-trained

BERT can be used to encode the input sequence.

We take the output of the last layer of transformer

blocks as the feature vector of the sequence. The at-

tention (Vaswani et al.,2017) mechanism of BERT

can extract rich and semantic information related to

the context. In addition, the calculation of attention

is parallel in the entire sequence, which is faster

than the feature extraction layer based on LSTM.

Different from vanilla BERT, we prune its layers

and add corpus tags to input sequences.

101

Layer Pruning:

The original BERT has 12 lay-

ers of transformer blocks, which will occupy a lot

of memory space. The time cost of calculating for

12 layers is too much for these basic tasks even

if data ﬂows in parallel. Inspired by Huang et al.

(2019), we only use 4 or 8 layers. Our experiment

found that using the ﬁrst eight layers performs well

on all tasks, and after compressing, four layers are

enough for CWS, POS tagging, and NER.

Corpus Tags:

Instead of a linear projection layer,

we use corpus tags to distinguish various tasks and

corpora. Each corpus of each task corresponds to

a speciﬁc corpus tag, and the embedding of these

tags needs to be initialized and optimized during

training. As shown in Figure 1, before inputting

the sequence into BERT, we add the corpus tag to

the head of the sequence. The attention mechanism

will ensure that the vector of the corpus tag and the

vector of each other position generate sufﬁciently

complex calculations to bring the corpus and task

information to each character.

2.2 CRF Decoder

We use the conditional random ﬁeld (CRF) (Laf-

ferty et al.,2001) to do the ﬁnal decoding work in

POS tagging, CWS, and NER tasks. In CRF, the

conditional probability of a label sequence can be

formalized as:

P(Y|X) = 1

Z(x;θ)exp(

t=1

θ>

1f1(X, yt)+

T−1

t=1

θ>

2f2(X, yt, yt+1)) (1)

where

are model parameters,

f1(X, yt)

is the

score for label

at position

f2(X, yt, yt+1)

the transition score from

yt+1

, and

Z(x;θ)

the normalization factor.

Compared with decoding using MLP only, CRF

utilizes the neighbor information. When decod-

ing using the Viterbi algorithm, CRF can get the

global optimal solution instead of the label with the

highest score for each position.

2.3 Biafﬁne Parser with Output of POS

tagging

This task refers to the work of Yan et al. (2020).

Yan’s work uses the biafﬁne parser to address both

CWS and dependency parsing tasks. Compared

with the work of Yan et al. (2020), our model will

use the output of POS tagging for two reasons.

First, dependency parsing has a large semantic and

formal gap with other tasks. As a result, sharing

the parameter space with other tasks will reduce its

performance. Our experimental results show that

when the prediction of dependency parsing is inde-

pendent of other tasks, the performance is worse

than that of training dependency parsing only. And

using the output of POS, dependency parsing can

get more useful information, such as word segmen-

tation and POS tagging labels. More importantly,

users have the need to obtain all information in one

sentence. If running POS tagging and dependency

parsing separately, the word segmentation results

of the two tasks may conﬂict, and this contradiction

cannot be resolved by engineering methods. Even

if there is error propagation in this way, our experi-

ment shows the negative impact is acceptable with

high POS tagging accuracy.

When predicting for dependency parsing, we

ﬁrst add the POS tagging corpus tag at the head of

the original sentence to get the POS tagging output.

Then we add the corpus tag of dependency parsing

at the head of the original sentence to get the feature

vector. Then, using the word segmentation results

from POS tagging to split the feature vector of

dependency parsing by token. The feature vectors

of characters in a token are averaged to represent

the token. In addition, embedding is established for

POS tagging labels, with the same dimension as the

feature vector. The feature vector of each token is

added to the embedding vector by position, and the

result is input into the biafﬁne parser. During the

training phase, the model uses golden POS tagging

labels. The premise of using POS tagging output is

that the corpus contains both dependency parsing

and POS tagging information.

2.4 Theseus Strategy

Theseus strategy (Xu et al.,2020) is a method to

compress BERT, and we use it to train the base

version of the model. As shown in Figure 2, after

getting the large version of the model we use the

module replacement strategy to train the four-layer

base model. The base model is initialized with the

ﬁrst four layers of the large model, and its layer

bound to the layer

2i−1

and

of the large model.

They are the corresponding modules. The training

phase is divided into two parts. In the ﬁrst part, we

randomly choose whether to replace the module

in the base model with its corresponding module

in the large model. And we make the choice for

102

Figure 2: This diagram explains the replacement strat-

egy when using Theseus method. When training the

base model, we randomly replace the layer of base

model with corresponding layers of large model. The

red arrows and yellow arrows represent two possible

data paths during training.

Figure 3: An example of segmentation of sequence

(c1, c2, c3, ...)combined with a user lexicon. Accord-

ing to the segmentation result of the maximum match-

ing algorithm, a bias will be added to scores marked in

red.

each module. We freeze the parameters of the large

model when using gradients to update parameters.

The replacement probability

is initialized to 0.5

and decreases linearly to 0. In the second part, We

only ﬁne-tune the base model and don’t replace the

modules anymore.

2.5 User Lexicon

In actual applications, users may process text of

speciﬁc domains, such as technology, medical.

There are proprietary vocabularies with high re-

call rates in such domains, and they rarely appear

in ordinary corpus. It is intuitive to use a user lex-

icon to address this problem. Users can choose

whether to add or use their lexicon. An example

of combining a user lexicon is shown in Figure 3.

When combined with a user lexicon, the maximum

matching algorithm (Wong and Chan,1996) is ﬁrst

performed to obtain a label sequence. After that, a

bias will be added to the corresponding scores out-

put by the encoder. And the result will be viewed

f1(X, yt)

in CRF in section 2.2. The bias is

Figure 4: The workﬂow of fastHan. As indicated by the

yellow arrows, data is converted between various for-

mats in each stage. The blue arrows reveal that fastHan

needs to act according to the task being performed cur-

rently.

calculated by the following equation:

bt= (max(y1:n)−average(y1:n)) ∗w(2)

where

is the bias on position t,

y1:n

is the scores

of each labels on position t output by the encoder,

and

is the coefﬁcient whose default value is 0.05.

CRF decoder will generate the global optimal so-

lution considering the bias. Users can set the co-

efﬁcient value according to the recall rate of their

lexicon. A development set can also be applied to

get the optimal coefﬁcient.

3 fastHan

FastHan is a Chinese NLP toolkit based on the

above model, developed based on fastNLP

and

PyTorch. We made a short video demonstrating

fastHan and uploaded it to YouTube3and bilibili4.

FastHan has been released on PYPI and users

can install it by pip:

pip install fastHan

3.1 Workﬂow

When FastHan initializes, it ﬁrst loads the pre-

trained model parameters from the ﬁle system.

Then, fastHan uses the pre-trained parameters to

initialize the backbone model. FastHan will down-

load parameters from our server automatically if it

has not been initialized in the current environment

before. After initialization, FastHan’s workﬂow is

shown in Figure 4.

In the preprocessing stage, fastHan ﬁrst adds a

corpus tag to the head of each sentence according

to the current task and then uses the vocabulary

to convert the sentence into a batch of vectors as

well as padding. FastHan is robust and does not

preprocess the original sentence redundantly, such

2https://github.com/fastnlp/fastnlp

3https://youtu.be/apM78cG06jY

4https://www.bilibili.com/video/

BV1ho4y117H3

103

Figure 5: An example of using fastHan. On the left is the code entered by the user, and on the right is the

corresponding output. The two sentences in the ﬁgure mean ”I like playing football” and ”Nanjing Yangtze River

Bridge”. The second sentence can be explained in a second way as ”Daqiao Jiang, mayor of the Nanjing city”, and

it is quite easy to include a user lexicon to customize the output of the second sentence.

as removing stop words, processing numbers and

English characters.

In the parsing phase, fastHan ﬁrst converts the

label sequence into character form and then parses

it. FastHan will return the result in a form which is

readable for users.

3.2 Usage

As shown in Figure 5, fastHan is easy to use. It

only needs one line of code to initialize, where

users can choose to use the base or large version of

the model.

When calling fastHan, users need to select the

task to be performed. The information of the three

tasks of CWS, POS, and dependency parsing is in

an inclusive relationship. And the information of

the NER task is independent of other tasks. The

input of FastHan can be a string or a list of strings.

In the output of fastHan, words and their attributes

are organized in the form of a list, which is conve-

nient for subsequent processing. By setting param-

eters, users can also put their user lexicon into use.

FastHan uses CTB label sets for POS tagging and

dependency parsing tasks, and uses MSRA label

set for NER.

Besides, users can call the

set device

function

to change the device utilized by the backbone

model. Using GPU can greatly accelerate the pre-

diction and ﬁne-tuning of fastHan.

3.3 Advanced Features

In addition to using fastHan as a off the shelf

model, users can utilize user lexicon and ﬁne-

tuning to enhance the performance of fastHan. As

for user lexicon, users can call the

add user dict

function to add their lexicon, and call the

set user dict weight

function to change the

weight coefﬁcient. As for ﬁne-tuning, users can

call the

finetune

function to load the formatted

data, make ﬁne-tuning, and save the model param-

eters.

Users can change the segmentation style by call-

ing the

set cws style

function. Each CWS corpus

has different granularity and coverage. By chang-

ing the corpus tag, fastHan will segment words in

the style of the corresponding corpus.

4 Evaluation

We evaluate fastHan in terms of accuracy, transfer-

ability, and execution speed.

4.1 Accuracy Test

The accuracy test is performed on the test set of

training data. We refer to the CWS corpora used by

(Chen et al.,2015;Huang et al.,2019), including

PKU, MSR, AS, CITYU (Emerson,2005), CTB-6

(Xue et al.,2005), SXU (Jin and Chen,2008), UD,

CNC, WTB (Wang et al.,2014) and ZX (Zhang

et al.,2014). More details can be found in (Huang

et al.,2019). For POS tagging and dependency

parsing, we use the Penn Chinese Treebank 9.0

(CTB-9) (Xue et al.,2005). For NER, we use

MSRA’s NER dataset and OntoNotes.

We conduct an additional set of experiments to

make the base version of fastHan trained on each

task separately. The ﬁnal results are shown in Ta-

ble 1. Both base and large models perform satis-

factorily. The result shows that multi-task learn-

ing greatly improves fastHan’s performance on all

tasks. The large version of fastHan outperforms

the current best model in CWS and POS. Although

fastHan’s score on NER and dependency parsing

is not the best, the parameters used by fastHan are

reduced by one-third due to layer prune. FastHan’s

performance on NER can also be enhanced by a

user lexicon with a high recall rate.

We also conduct an experiment about user lexi-

con on 10 CWS corpus respectively. With each cor-

pus, a word is added to the lexicon once it has ap-

peared in the training set. With such a low-quality

lexicon, fastHan’s score increases by an average

of 0.127 percentage points. It is feasible to use

104

Model CWS Dependency Parsing POS NER MSRA NER OntoNotes

F Fudep,Fldep F F F

SOTA models 97.1 85.66,81.71 93.15 96.09 81.82

fastHan base trained separately 97.15 80.2, 75.12 94.27 92.2 80.3

fastHan base trained jointly 97.27 81.22, 76.71 94.88 94.33 82.86

fastHan large trained jointly 97.41 85.52, 81.38 95.66 95.50 83.82

Table 1: The results of fastHan’s accuracy result. The score of CWS is the average of 10 corpora. When training

dependency parsing separately, the biafﬁne parser use the same architecture as Yan et al. (2020). SOTA models are

best-performing work we know for each task. They came from Huang et al. (2019), Yan et al. (2020), Meng et al.

(2019), Li et al. (2020) in order. Li et al. (2020) uses lexicon to enhance the model.

user lexicon to enhance fastHan’s performance in

speciﬁc domains.

4.2 Transferability Test

Segmentation Tool Weibo Test Set

jieba 83.58

SnowNLP 79.65

THULAC 86.65

LTP-4.0 92.05

fastHan 93.38

fastHan(fine-tuned) 96.64

Table 2: Transfer test for fastHan, using span F metric.

We use the test set of Weibo, which has 8092 samples.

For LTP-4.0, we use the base version, which has the

best performance among their models.

For an NLP toolkit designed for the open do-

main, the ability of processing samples not in the

training corpus is very important. We perform the

transfer test on Weibo (Qiu et al.,2016), which

has no overlap with our training data. Samples

in Weibo

come from the Internet, and they are

complex enough to test the model’s transferabil-

ity. We choose to test on CWS because nearly all

Chinese NLP tools have this feature. We choose

popular toolkits as the contrast, including Jieba

THULAC

, SnowNLP

and LTP-4.0

. We also per-

form a test of ﬁne-tuning using the training set of

Weibo.

The results are shown in Table 2. As a off the

shelf model, FastHan outperforms jieba, SnowNLP,

and THULAC a lot. LTP-4.0 (Che et al.,2020) is

another technical route for multi-task Chinese NLP,

which is released after the ﬁrst release of fastHan.

However, FastHan still outperforms LTP with a

5https://github.com/FudanNLP/

NLPCC-WordSeg- Weibo

6https://github.com/fxsjy/jieba

7https://github.com/thunlp/THULAC

8https://github.com/isnowfy/snownlp

9https://github.com/HIT-SCIR/ltp

much smaller model (262MB versus 492MB). The

result proves fastHan is robust to new samples, and

the ﬁne-tuning feature allows fastHan to better be

adapted to new criteria.

4.3 Speed Test

Models Dependency Parsing Other Tasks

CPU, GPU CPU, GPU

fastHan base 25, 22 55, 111

fastHan large 14, 21 28, 97

Table 3: Speed test for fastHan. The numbers in the

table represent the average number of sentences pro-

cessed per second.

The speed test was performed on a personal

computer conﬁgured with Intel Core i5-9400f +

NVIDIA GeForce GTX 1660ti. The test was con-

ducted on the ﬁrst 800 sentences of the CTB CWS

corpus, with an average of 45.2 characters per sen-

tence and a batch size of 8.

The results are shown in Table 3. Dependency

parsing runs slower, and the other tasks run at about

the same speed. The base model with GPU per-

forms poorly in dependency parsing because depen-

dency parsing requires a lot of CPU calculations,

and the acceleration effect of GPU is less than the

burden of information transfer.

5 Conclusion

In this paper, we presented fastHan, a BERT-based

toolkit for CWS, NER, POS, and dependency

parsing in Chinese NLP. After our optimization,

fastHan has the characteristics of high accuracy,

small size, strong transferability, and ease of use.

In the future, we will continue to improve the

fastHan with better performance, more features

and more efﬁcient learning methods, such as meta-

learning (Ke et al.,2021).

105

Acknowledgements

This work was supported by the National Key

Research and Development Program of China

(No. 2020AAA0106700), National Natural Sci-

ence Foundation of China (No. 62022027) and

Major Scientiﬁc Research Project of Zhejiang Lab

(No. 2019KD0AD01).

References

Wanxiang Che, Yunlong Feng, Libo Qin, and Ting Liu.

2020. N-ltp: A open-source neural chinese language

technology platform with pretrained models. arXiv

preprint arXiv:2009.11616.

Xinchi Chen, Xipeng Qiu, and Xuanjing Huang.

2017a. A feature-enriched neural model for joint

chinese word segmentation and part-of-speech tag-

ging. In Proceedings of the Twenty-Sixth Inter-

national Joint Conference on Artiﬁcial Intelligence,

IJCAI-17, pages 3960–3966.

Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu,

and Xuanjing Huang. 2015. Long short-term mem-

ory neural networks for chinese word segmenta-

tion. In Proceedings of the Conference on Empiri-

cal Methods in Natural Language Processing, pages

1197–1206.

Xinchi Chen, Zhan Shi, Xipeng Qiu, and XuanJing

Huang. 2017b. Adversarial multi-criteria learning

for chinese word segmentation. In Proceedings

of the 55th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 1193–1203.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2018. Bert: Pre-training of deep

bidirectional transformers for language understand-

ing. arXiv preprint arXiv:1810.04805.

Timothy Dozat and Christopher D Manning. 2016.

Deep biafﬁne attention for neural dependency pars-

ing. arXiv preprint arXiv:1611.01734.

Thomas Emerson. 2005. The second international chi-

nese word segmentation bakeoff. In Proceedings of

the fourth SIGHAN workshop on Chinese language

Processing.

Weipeng Huang, Xingyi Cheng, Kunlong Chen,

Taifeng Wang, and Wei Chu. 2019. Toward

fast and accurate neural chinese word segmenta-

tion with multi-criteria learning. arXiv preprint

arXiv:1903.04190.

Guangjin Jin and Xiao Chen. 2008. The fourth inter-

national chinese language processing bakeoff: Chi-

nese word segmentation, named entity recognition

and chinese pos tagging. In Proceedings of the sixth

SIGHAN workshop on Chinese language process-

ing.

Zhen Ke, Liang Shi, Songtao Sun, Erli Meng, Bin

Wang, and Xipeng Qiu. 2021. Pre-training with

meta learning for Chinese word segmentation. In

Proceedings of the 2021 Conference of the North

American Chapter of the Association for Computa-

tional Linguistics: Human Language Technologies,

pages 5514–5523, Online. Association for Compu-

tational Linguistics.

John Lafferty, Andrew McCallum, and Fernando CN

Pereira. 2001. Conditional random ﬁelds: Prob-

abilistic models for segmenting and labeling se-

quence data. In ICML.

Xiaonan Li, Hang Yan, Xipeng Qiu, and Xuanjing

Huang. 2020. FLAT: Chinese NER using ﬂat-lattice

transformer. In Proceedings of the 58th Annual

Meeting of the Association for Computational Lin-

guistics, pages 6836–6842, Online. Association for

Computational Linguistics.

Yuxian Meng, Wei Wu, Fei Wang, Xiaoya Li, Ping Nie,

Fan Yin, Muyu Li, Qinghong Han, Xiaofei Sun, and

Jiwei Li. 2019. Glyce: Glyph-vectors for chinese

character representations. In Advances in Neural In-

formation Processing Systems, pages 2746–2757.

Xipeng Qiu, Peng Qian, and Zhan Shi. 2016. Overview

of the NLPCC-ICCPOL 2016 shared task: Chinese

word segmentation for micro-blog texts. In Proceed-

ings of The Fifth Conference on Natural Language

Processing and Chinese Computing & The Twenty

Fourth International Conference on Computer Pro-

cessing of Oriental Languages.

Xipeng Qiu, TianXiang Sun, Yige Xu, Yunfan Shao,

Ning Dai, and Xuanjing Huang. 2020. Pre-trained

models for natural language processing: A sur-

vey.SCIENCE CHINA Technological Sciences,

63(10):1872–1897.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz

Kaiser, and Illia Polosukhin. 2017. Attention is all

you need. In Advances in neural information pro-

cessing systems, pages 5998–6008.

William Yang Wang, Lingpeng Kong, Kathryn

Mazaitis, and William Cohen. 2014. Dependency

parsing for weibo: An efﬁcient probabilistic logic

programming approach. In Proceedings of the 2014

conference on empirical methods in natural lan-

guage processing (EMNLP), pages 1152–1158.

Pak-kwong Wong and Chorkin Chan. 1996. Chinese

word segmentation based on maximum matching

and word binding force. In Proceedings of the 16th

Conference on Computational Linguistics - Volume

1, COLING ’96, page 200–203, USA. Association

for Computational Linguistics.

Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei,

and Ming Zhou. 2020. Bert-of-theseus: Compress-

ing bert by progressive module replacing. arXiv

preprint arXiv:2002.02925.

106

Naiwen Xue, Fei Xia, Fu-Dong Chiou, and Marta

Palmer. 2005. The penn chinese treebank: Phrase

structure annotation of a large corpus. Natural lan-

guage engineering, 11(2):207.

Hang Yan, Xipeng Qiu, and Xuanjing Huang. 2020. A

graph-based model for joint chinese word segmen-

tation and dependency parsing. Transactions of the

Association for Computational Linguistics, 8:78–92.

Meishan Zhang, Yue Zhang, Wanxiang Che, and Ting

Liu. 2014. Type-supervised domain adaptation for

joint segmentation and pos-tagging. In Proceed-

ings of the 14th Conference of the European Chap-

ter of the Association for Computational Linguistics,

pages 588–597.

Yu Zhang, Zhenghua Li, Houquan Zhou, and Min

Zhang. 2020. Is pos tagging necessary or even help-

ful for neural dependency parsing? arXiv preprint

arXiv:2003.03204.

Multimodal digital assessment of depression with actigraphy and app in Hong Kong Chinese

Article

Full-text available

Mar 2024

There is an emerging potential for digital assessment of depression. In this study, Chinese patients with major depressive disorder (MDD) and controls underwent a week of multimodal measurement including actigraphy and app-based measures (D-MOMO) to record rest-activity, facial expression, voice, and mood states. Seven machine-learning models (Random Forest [RF], Logistic regression [LR], Support vector machine [SVM], K-Nearest Neighbors [KNN], Decision tree [DT], Naive Bayes [NB], and Artificial Neural Networks [ANN]) with leave-one-out cross-validation were applied to detect lifetime diagnosis of MDD and non-remission status. Eighty MDD subjects and 76 age- and sex-matched controls completed the actigraphy, while 61 MDD subjects and 47 controls completed the app-based assessment. MDD subjects had lower mobile time (P = 0.006), later sleep midpoint (P = 0.047) and Acrophase (P = 0.024) than controls. For app measurement, MDD subjects had more frequent brow lowering (P = 0.023), less lip corner pulling (P = 0.007), higher pause variability (P = 0.046), more frequent self-reference (P = 0.024) and negative emotion words (P = 0.002), lower articulation rate (P < 0.001) and happiness level (P < 0.001) than controls. With the fusion of all digital modalities, the predictive performance (F1-score) of ANN for a lifetime diagnosis of MDD was 0.81 and 0.70 for non-remission status when combined with the HADS-D item score, respectively. Multimodal digital measurement is a feasible diagnostic tool for depression in Chinese. A combination of multimodal measurement and machine-learning approach has enhanced the performance of digital markers in phenotyping and diagnosis of MDD.

Detection of Suicidal Ideation in Clinical Interviews for Depression Using Natural Language Processing and Machine Learning: Cross-Sectional Study

Article

Full-text available

Dec 2023

Background Assessing patients’ suicide risk is challenging, especially among those who deny suicidal ideation. Primary care providers have poor agreement in screening suicide risk. Patients’ speech may provide more objective, language-based clues about their underlying suicidal ideation. Text analysis to detect suicide risk in depression is lacking in the literature. Objective This study aimed to determine whether suicidal ideation can be detected via language features in clinical interviews for depression using natural language processing (NLP) and machine learning (ML). Methods This cross-sectional study recruited 305 participants between October 2020 and May 2022 (mean age 53.0, SD 11.77 years; female: n=176, 57%), of which 197 had lifetime depression and 108 were healthy. This study was part of ongoing research on characterizing depression with a case-control design. In this study, 236 participants were nonsuicidal, while 56 and 13 had low and high suicide risks, respectively. The structured interview guide for the Hamilton Depression Rating Scale (HAMD) was adopted to assess suicide risk and depression severity. Suicide risk was clinician rated based on a suicide-related question (H11). The interviews were transcribed and the words in participants’ verbal responses were translated into psychologically meaningful categories using Linguistic Inquiry and Word Count (LIWC). Results Ordinal logistic regression revealed significant suicide-related language features in participants’ responses to the HAMD questions. Increased use of anger words when talking about work and activities posed the highest suicide risk (odds ratio [OR] 2.91, 95% CI 1.22-8.55; P =.02). Random forest models demonstrated that text analysis of the direct responses to H11 was effective in identifying individuals with high suicide risk (AUC 0.76-0.89; P <.001) and detecting suicide risk in general, including both low and high suicide risk (AUC 0.83-0.92; P <.001). More importantly, suicide risk can be detected with satisfactory performance even without patients’ disclosure of suicidal ideation. Based on the response to the question on hypochondriasis, ML models were trained to identify individuals with high suicide risk (AUC 0.76; P <.001). Conclusions This study examined the perspective of using NLP and ML to analyze the texts from clinical interviews for suicidality detection, which has the potential to provide more accurate and specific markers for suicidal ideation detection. The findings may pave the way for developing high-performance assessment of suicide risk for automated detection, including online chatbot-based interviews for universal screening.

Prediction and Analysis of Multiple Causes of Mental Health Problems Based on Machine Learning

Chapter

Apr 2024

To prevent other types of mental health problems from being misclassified as depression, as well as to remedy the problem of inadequate resources for mental health consultations. This study first analyzes the types of different causes of mental health problems, providing an important basis for better understanding the diversity and complexity of this field. Subsequently, a machine learning approach was used to predict the potential causes of different types of mental health problems. This research provides new perspectives and methods for early identification and personalized treatment of mental health problems. The experimental results show that depression accounts for only 16.9% of mental health problems. In the prediction of the causes of mental health problems, the SVM method performed best in predicting the causes of mental health problems, outperforming 5 machine learning methods and 3 deep learning methods. Through these studies, we hope to prevent other types of mental health problems from being misclassified as depression and to remedy the lack of resources for mental health counseling. This will help increase the success rate of early intervention and provide better mental health support for patients.

Biomedical Causal Relation Extraction Incorporated with External Knowledge

Chapter

Feb 2024

Biomedical causal relation extraction is an important task. It aims to analyze biomedical texts and extract structured information such as named entities, semantic relations and function type. In recent years, some related works have largely improved the performance of biomedical causal relation extraction. However, they only focus on contextual information and ignore external knowledge. In view of this, we introduce entity information from external knowledge base as a prompt to enrich the input text, and propose a causal relation extraction framework JNT_KB incorporating entity information to support the underlying understanding for causal relation extraction. Experimental results show that JNT_KB consistently outperforms state-of-the-art extraction models, and the final extraction performance F1 score in Stage 2 is as high as 61.0%.

A Study of Chinese Medicine Entity Recognition Method by Fusing Multi-Features and Pointer Networks

Conference Paper

Oct 2023

Multi-Task Learning with Knowledge Distillation for Dense Prediction

Conference Paper

Oct 2023

Long Text Classification Using Pre-trained Language Model for a Low-Resource Language

Conference Paper

Mar 2023

CCLOOW: Chinese children's lexicon of oral words

Article

Mar 2023
BEHAV RES METHODS

In this article, we introduce the Chinese Children's Lexicon of Oral Words (CCLOOW), the first lexical database based on animated movies and TV series for 3-to-9-year-old Chinese children. The database computes from 2.7 million character tokens and 1.8 million word tokens. It contains 3920 unique character and 22,229 word types. CCLOOW reports frequency and contextual diversity metrics of the characters and words, as well as length and syntactic categories of the words. CCLOOW frequency and contextual diversity measures correlated well with other Chinese lexical databases, particularly well with that computed from children's books. The predictive validity of CCLOOW measures were confirmed with Grade 2 children's naming and lexical decision experiments. Further, we found that CCLOOW frequencies could explain a considerable proportion in adults' written word recognition, indicating that early language experience might have lasting impacts on the mature lexicon. CCLOOW provides validated frequency and contextual diversity estimates that complements current children's lexical database based on written language samples. It is freely accessible online at https://www.learn2read.cn/ccloow .

A Multi-Granularity Word Fusion Method for Chinese NER

Article

Full-text available

Feb 2023

Named entity recognition (NER) plays a crucial role in many downstream natural language processing (NLP) tasks. It is challenging for Chinese NER because of certain features of Chinese. Recently, large-scaled pre-training language models have been used in Chinese NER. However, since some of the pre-training language models do not use word information or just employ word information of single granularity, the semantic information in sentences could not be fully captured, which affects these models' performance. To fully take advantage of word information and obtain richer semantic information, we propose a multi-granularity word fusion method for Chinese NER. We introduce multi-granularity word information into our model. To make full use of the information, we classify the information into three kinds: strong information, moderate information, and weak information. These kinds of information are encoded by encoders and then integrated with each other through the strong-weak feedback attention mechanism. Specifically, we apply two separate attention networks to word embeddings and n-grams embeddings. Then, the outputs are fused into another attention. In these three attentions, character embeddings are used to be the query of attentions. We call the results the multi-granularity word information. To combine character information and multi-granularity word information, we introduce two fusion strategies for better performance. The process makes our model obtain rich semantic information and reduces word segmentation errors and noise in an explicit way. We design experiments to get our model's best performance by comparing some components. Ablation study is used to verify the effectiveness of each module. The final experiments are conducted on four Chinese NER benchmark datasets and the F1 scores are 81.51% for Ontonotes4.0, 95.47% for MSRA, 95.87% for Resume, and 69.41% for Weibo. The best improvement achieved by the proposed method is 1.37%. Experimental results show that our method outperforms most baselines and achieves the state-of-the-art method in performance.

Does FinTech Improve Traditional Banks’ Operating Efficiency and Risk Exposure? Machine Learning-Based Evidence from Patent Filings in China

Article

Jan 2023

Pre-trained models for natural language processing: A survey

Article

Full-text available

Oct 2020

Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.

Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing?

Chapter

Full-text available

Oct 2020

In the pre deep learning era, part-of-speech tags have been considered as indispensable ingredients for feature engineering in dependency parsing. But quite a few works focus on joint tagging and parsing models to avoid error propagation. In contrast, recent studies suggest that POS tagging becomes much less important or even useless for neural parsing, especially when using character-based word representations. Yet there are not enough investigations focusing on this issue, both empirically and linguistically. To answer this, we design and compare three typical multi-task learning framework, i.e., Share-Loose, Share-Tight, and Stack, for joint tagging and parsing based on the state-of-the-art biaffine parser. Considering that it is much cheaper to annotate POS tags than parse trees, we also investigate the utilization of large-scale heterogeneous POS tag data. We conduct experiments on both English and Chinese datasets, and the results clearly show that POS tagging (both homogeneous and heterogeneous) can still significantly improve parsing performance when using the Stack joint framework. We conduct detailed analysis and gain more insights from the linguistic aspect.

A Graph-based Model for Joint Chinese Word Segmentation and Dependency Parsing

Article

Full-text available

Jul 2020

Chinese word segmentation and dependency parsing are two fundamental tasks for Chinese natural language processing. The dependency parsing is defined at the word-level. Therefore word segmentation is the precondition of dependency parsing, which makes dependency parsing suffer from error propagation and unable to directly make use of character-level pre-trained language models (such as BERT). In this paper, we propose a graph-based model to integrate Chinese word segmentation and dependency parsing. Different from previous transition-based joint models, our proposed model is more concise, which results in fewer efforts of feature engineering. Our graph-based joint model achieves better performance than previous joint models and state-of-the-art results in both Chinese word segmentation and dependency parsing. Additionally, when BERT is combined, our model can substantially reduce the performance gap of dependency parsing between joint models and gold-segmented word-based models. Our code is publicly available at https://github.com/fastnlp/JointCwsParser

Adversarial Multi-Criteria Learning for Chinese Word Segmentation

Conference Paper

Full-text available

Jan 2017

Pre-training with Meta Learning for Chinese Word Segmentation

Conference Paper

Jan 2021

Towards Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning

Conference Paper

Jan 2020

FLAT: Chinese NER Using Flat-Lattice Transformer

Conference Paper

Jan 2020

A Feature-Enriched Neural Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

Conference Paper

Aug 2017

Recently, neural network models for natural language processing tasks have been increasingly focused on for their ability of alleviating the burden of manual feature engineering. However, the previous neural models cannot extract the complicated feature compositions as the traditional methods with discrete features. In this work, we propose a feature-enriched neural model for joint Chinese word segmentation and part-of-speech tagging task. Specifically, to simulate the feature templates of traditional discrete feature based models, we use different filters to model the complex compositional features with convolutional and pooling layer, and then utilize long distance dependency information with recurrent layer. Experimental results on five different datasets show the effectiveness of our proposed model.

Attention Is All You Need

Article

Jun 2017

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Overview of the NLPCC-ICCPOL 2016 shared task: Chinese word segmentation for micro-blog texts

Conference Paper

Dec 2016

In this paper, we give an overview for the shared task at the 5th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2016): Chinese word segmentation for micro-blog texts. Different with the popular used newswire datasets, the dataset of this shared task consists of the relatively informal micro-texts. Besides, we also use a new psychometric-inspired evaluation metric for Chinese word segmentation, which addresses to balance the very skewed word distribution at different levels of difficulty. The data and evaluation codes can be downloaded from https:// github. com/ FudanNLP/ NLPCC-WordSeg-Weibo.

fastHan: A BERT-based Multi-Task Toolkit for Chinese NLP

Figures

Recommended publications

Two-Fold Filtering for Chinese Subcategorization Acquisition with Diathesis Alternations Used as Heu...

Chinese main verb identification: From specification to realization

Chinese semantic role labeling using CRFs and SVMs

An unsupervised approach for learning a Chinese IS-A taxonomy from an unstructured corpus