Conference PaperPDF Available

fastHan: A BERT-based Multi-Task Toolkit for Chinese NLP

Authors:
Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing: System Demonstrations, pages 99–106, August 1st - August 6th, 2021.
©2021 Association for Computational Linguistics
99
fastHan: A BERT-based Multi-Task Toolkit for Chinese NLP
Zhichao Geng, Hang Yan, Xipeng Qiu
, Xuanjing Huang
School of Computer Science, Fudan University
Key Laboratory of Intelligent Information Processing, Fudan University
{zcgeng20,hyan19,xpqiu,xjhuang}@fudan.edu.cn
Abstract
We present fastHan, an open-source toolkit
for four basic tasks in Chinese natural lan-
guage processing: Chinese word segmenta-
tion (CWS), Part-of-Speech (POS) tagging,
named entity recognition (NER), and depen-
dency parsing. The backbone of fastHan is
a multi-task model based on a pruned BERT,
which uses the first 8 layers in BERT. We
also provide a 4-layer base model compressed
from the 8-layer model. The joint-model is
trained and evaluated on 13 corpora of four
tasks, yielding near state-of-the-art (SOTA)
performance in dependency parsing and NER,
achieving SOTA performance in CWS and
POS. Besides, fastHan’s transferability is also
strong, performing much better than popular
segmentation tools on a non-training corpus.
To better meet the need of practical applica-
tion, we allow users to use their own labeled
data to further fine-tune fastHan. In addition
to its small size and excellent performance,
fastHan is user-friendly. Implemented as a
python package, fastHan isolates users from
the internal technical details and is convenient
to use. The project is released on Github1.
1 Introduction
Recently, the need for Chinese natural language
processing (NLP) has a dramatic increase for many
downstream applications. There are four basic
tasks for Chinese NLP: Chinese word segmenta-
tion (CWS), Part-of-Speech (POS) tagging, named
entity recognition (NER), and dependency pars-
ing. CWS is a character-level task while others are
word-level tasks. These basic tasks are usually the
cornerstones or provide useful features for other
downstream tasks.
However, the Chinese NLP community lacks an
effective toolkit utilizing the correlation between
Corresponding author
1https://github.com/fastnlp/fastHan
the tasks. Tools developed for a single task cannot
achieve the highest accuracy, and loading tools for
each task will take up more memory. In practical,
there is a strong correlation between these four ba-
sic Chinese NLP tasks. For example, the model will
perform better in the other three word-level tasks if
its word segmentation ability is stronger. Recently,
Chen et al. (2017a) adopt cross-label to label the
POS so that POS tagging and CWS can be trained
jointly. Yan et al. (2020) propose a graph-based
model for joint CWS and dependency parsing, in
which a special ”APP” dependency arc is used to
indicate the word segmentation information. Thus,
they can jointly train the word-level dependency
parsing task and character-level CWS task with the
biaffine parser (Dozat and Manning,2016). Chen
et al. (2017b) explore adversarial multi-criteria
learning for CWS, proving more knowledge can be
mined through training model on more corpora. As
a result, there are many pieces of research on how
to perform multi-corpus training on these tasks and
how to conduct multi-task joint training. Zhang
et al. (2020) show the joint training of POS tagging
and dependency parsing can improve each other’s
performance and so on. Results of the CWS task
are contained in the output of the POS tagging task.
Therefore, we developed fastHan, an efficient
toolkit with the help of multi-task learning and pre-
trained models (PTMs) (Qiu et al.,2020). FastHan
adopts a BERT-based (Devlin et al.,2018) joint-
model on 13 corpora to address the above four tasks.
Through multi-task learning, fastHan shares knowl-
edge among the different corpora. This shared
information can improve fastHan’s performance on
these tasks. Besides, training on more corpora can
obtain a larger vocabulary, which can reduce the
number of times the model encounters characters
outs of vocabulary. What’s more, the joint-model
can greatly reduce the occupied memory space.
Compared with training a model for each task, the
100
joint-model can reduce the occupied memory space
by four times.
FastHan has two versions of the backbone model,
base and large. The large model uses the first eight
layers of BERT, and the base model uses the The-
seus strategy (Xu et al.,2020) to compress the large
model to four layers. To improve the performance
of the model, fastHan has done much optimization.
For example, using the output of POS tagging to
improve the performance of the dependency pars-
ing task, using Theseus strategy to improve the
performance of the base version model, and so on.
Overall, fastHan has the following advantages:
Small size:
The total parameter of the base model
is 151MB, and for the large model the number
is 262MB.
High accuracy:
The base version of the model
achieved good results in all tasks, while the
large version of the model approached SOTA
in dependency parsing and NER, and achieved
SOTA performance in CWS and POS.
Strong transferability:
Multi-task learning al-
lows fastHan to adapt to multiple criteria, and
a large number of corpus allows fastHan to
mine knowledge from rare samples. As a re-
sult, fastHan is robust to new samples. Our
experiments in section 4.2 show fastHan out-
performs popular segmentation tools on non-
training dataset.
Easy to use:
FastHan is implemented as a python
package, and users can get started with its
basic functions in one minute. Besides, all
advanced features, such as user lexicon and
fine-tuning, only need one line of code to use.
For developers of downstream applications, they
do not need to do repetitive work for basic tasks
and do not need to understand complex codes like
BERT. Even if users have little knowledge of deep
learning, by using fastHan they can get the re-
sults of SOTA performance conveniently. Also,
the smaller size can reduce the need for hardware,
so that fastHan can be deployed on more platforms.
For the Chinese NLP research community, the
results of fastHan can be used as a unified prepro-
cessing standard with high quality.
Besides, the idea of fastHan is not restricted to
Chinese. Applying multi-task learning to enhance
NLP toolkits also has practical value in other lan-
guages.
Figure 1: Architecture of the proposed model. The in-
puts are characters embeddings.
2 Backbone Model
The backbone of fastHan is a joint-model based
on BERT, which performs multi-task learning on
13 corpora of the four tasks. The architecture of
the model is shown in Figure 1. For this model,
sentences of different tasks are first added with
corpus tags at the beginning of the sentence. And
then the sentences are input into the BERT-based
encoder and the decoding layer. The decoding layer
will use different decoders according to the current
task: use conditional random field (CRF) to decode
in the NER task; use MLP and CRF to decode
in POS tagging and CWS task; use the output of
POS tagging task combined with biaffine parser to
decode in dependency parsing task.
Each task uses independent label sets here, CWS
uses label set
Y={B, M , E, S}
; POS tagging
uses cross-labels set based on
{B, M , E, S}
; NER
uses cross-labels set based on
{B, M , E, S, O}
;
dependency parsing uses arc heads and arc labels
to represent dependency grammar tree.
2.1 BERT-based feature extraction layer
BERT (Devlin et al.,2018) is a language model
trained in large-scale corpus. The pre-trained
BERT can be used to encode the input sequence.
We take the output of the last layer of transformer
blocks as the feature vector of the sequence. The at-
tention (Vaswani et al.,2017) mechanism of BERT
can extract rich and semantic information related to
the context. In addition, the calculation of attention
is parallel in the entire sequence, which is faster
than the feature extraction layer based on LSTM.
Different from vanilla BERT, we prune its layers
and add corpus tags to input sequences.
101
Layer Pruning:
The original BERT has 12 lay-
ers of transformer blocks, which will occupy a lot
of memory space. The time cost of calculating for
12 layers is too much for these basic tasks even
if data flows in parallel. Inspired by Huang et al.
(2019), we only use 4 or 8 layers. Our experiment
found that using the first eight layers performs well
on all tasks, and after compressing, four layers are
enough for CWS, POS tagging, and NER.
Corpus Tags:
Instead of a linear projection layer,
we use corpus tags to distinguish various tasks and
corpora. Each corpus of each task corresponds to
a specific corpus tag, and the embedding of these
tags needs to be initialized and optimized during
training. As shown in Figure 1, before inputting
the sequence into BERT, we add the corpus tag to
the head of the sequence. The attention mechanism
will ensure that the vector of the corpus tag and the
vector of each other position generate sufficiently
complex calculations to bring the corpus and task
information to each character.
2.2 CRF Decoder
We use the conditional random field (CRF) (Laf-
ferty et al.,2001) to do the final decoding work in
POS tagging, CWS, and NER tasks. In CRF, the
conditional probability of a label sequence can be
formalized as:
P(Y|X) = 1
Z(x;θ)exp(
T
X
t=1
θ>
1f1(X, yt)+
T1
X
t=1
θ>
2f2(X, yt, yt+1)) (1)
where
θ
are model parameters,
f1(X, yt)
is the
score for label
yt
at position
t
,
f2(X, yt, yt+1)
is
the transition score from
yt
to
yt+1
, and
Z(x;θ)
is
the normalization factor.
Compared with decoding using MLP only, CRF
utilizes the neighbor information. When decod-
ing using the Viterbi algorithm, CRF can get the
global optimal solution instead of the label with the
highest score for each position.
2.3 Biaffine Parser with Output of POS
tagging
This task refers to the work of Yan et al. (2020).
Yan’s work uses the biaffine parser to address both
CWS and dependency parsing tasks. Compared
with the work of Yan et al. (2020), our model will
use the output of POS tagging for two reasons.
First, dependency parsing has a large semantic and
formal gap with other tasks. As a result, sharing
the parameter space with other tasks will reduce its
performance. Our experimental results show that
when the prediction of dependency parsing is inde-
pendent of other tasks, the performance is worse
than that of training dependency parsing only. And
using the output of POS, dependency parsing can
get more useful information, such as word segmen-
tation and POS tagging labels. More importantly,
users have the need to obtain all information in one
sentence. If running POS tagging and dependency
parsing separately, the word segmentation results
of the two tasks may conflict, and this contradiction
cannot be resolved by engineering methods. Even
if there is error propagation in this way, our experi-
ment shows the negative impact is acceptable with
high POS tagging accuracy.
When predicting for dependency parsing, we
first add the POS tagging corpus tag at the head of
the original sentence to get the POS tagging output.
Then we add the corpus tag of dependency parsing
at the head of the original sentence to get the feature
vector. Then, using the word segmentation results
from POS tagging to split the feature vector of
dependency parsing by token. The feature vectors
of characters in a token are averaged to represent
the token. In addition, embedding is established for
POS tagging labels, with the same dimension as the
feature vector. The feature vector of each token is
added to the embedding vector by position, and the
result is input into the biaffine parser. During the
training phase, the model uses golden POS tagging
labels. The premise of using POS tagging output is
that the corpus contains both dependency parsing
and POS tagging information.
2.4 Theseus Strategy
Theseus strategy (Xu et al.,2020) is a method to
compress BERT, and we use it to train the base
version of the model. As shown in Figure 2, after
getting the large version of the model we use the
module replacement strategy to train the four-layer
base model. The base model is initialized with the
first four layers of the large model, and its layer
i
is
bound to the layer
2i1
and
2i
of the large model.
They are the corresponding modules. The training
phase is divided into two parts. In the first part, we
randomly choose whether to replace the module
in the base model with its corresponding module
in the large model. And we make the choice for
102
Figure 2: This diagram explains the replacement strat-
egy when using Theseus method. When training the
base model, we randomly replace the layer of base
model with corresponding layers of large model. The
red arrows and yellow arrows represent two possible
data paths during training.
Figure 3: An example of segmentation of sequence
(c1, c2, c3, ...)combined with a user lexicon. Accord-
ing to the segmentation result of the maximum match-
ing algorithm, a bias will be added to scores marked in
red.
each module. We freeze the parameters of the large
model when using gradients to update parameters.
The replacement probability
p
is initialized to 0.5
and decreases linearly to 0. In the second part, We
only fine-tune the base model and don’t replace the
modules anymore.
2.5 User Lexicon
In actual applications, users may process text of
specific domains, such as technology, medical.
There are proprietary vocabularies with high re-
call rates in such domains, and they rarely appear
in ordinary corpus. It is intuitive to use a user lex-
icon to address this problem. Users can choose
whether to add or use their lexicon. An example
of combining a user lexicon is shown in Figure 3.
When combined with a user lexicon, the maximum
matching algorithm (Wong and Chan,1996) is first
performed to obtain a label sequence. After that, a
bias will be added to the corresponding scores out-
put by the encoder. And the result will be viewed
as
f1(X, yt)
in CRF in section 2.2. The bias is
Figure 4: The workflow of fastHan. As indicated by the
yellow arrows, data is converted between various for-
mats in each stage. The blue arrows reveal that fastHan
needs to act according to the task being performed cur-
rently.
calculated by the following equation:
bt= (max(y1:n)average(y1:n)) w(2)
where
bt
is the bias on position t,
y1:n
is the scores
of each labels on position t output by the encoder,
and
w
is the coefficient whose default value is 0.05.
CRF decoder will generate the global optimal so-
lution considering the bias. Users can set the co-
efficient value according to the recall rate of their
lexicon. A development set can also be applied to
get the optimal coefficient.
3 fastHan
FastHan is a Chinese NLP toolkit based on the
above model, developed based on fastNLP
2
and
PyTorch. We made a short video demonstrating
fastHan and uploaded it to YouTube3and bilibili4.
FastHan has been released on PYPI and users
can install it by pip:
pip install fastHan
3.1 Workflow
When FastHan initializes, it first loads the pre-
trained model parameters from the file system.
Then, fastHan uses the pre-trained parameters to
initialize the backbone model. FastHan will down-
load parameters from our server automatically if it
has not been initialized in the current environment
before. After initialization, FastHan’s workflow is
shown in Figure 4.
In the preprocessing stage, fastHan first adds a
corpus tag to the head of each sentence according
to the current task and then uses the vocabulary
to convert the sentence into a batch of vectors as
well as padding. FastHan is robust and does not
preprocess the original sentence redundantly, such
2https://github.com/fastnlp/fastnlp
3https://youtu.be/apM78cG06jY
4https://www.bilibili.com/video/
BV1ho4y117H3
103
Figure 5: An example of using fastHan. On the left is the code entered by the user, and on the right is the
corresponding output. The two sentences in the figure mean ”I like playing football” and ”Nanjing Yangtze River
Bridge”. The second sentence can be explained in a second way as ”Daqiao Jiang, mayor of the Nanjing city”, and
it is quite easy to include a user lexicon to customize the output of the second sentence.
as removing stop words, processing numbers and
English characters.
In the parsing phase, fastHan first converts the
label sequence into character form and then parses
it. FastHan will return the result in a form which is
readable for users.
3.2 Usage
As shown in Figure 5, fastHan is easy to use. It
only needs one line of code to initialize, where
users can choose to use the base or large version of
the model.
When calling fastHan, users need to select the
task to be performed. The information of the three
tasks of CWS, POS, and dependency parsing is in
an inclusive relationship. And the information of
the NER task is independent of other tasks. The
input of FastHan can be a string or a list of strings.
In the output of fastHan, words and their attributes
are organized in the form of a list, which is conve-
nient for subsequent processing. By setting param-
eters, users can also put their user lexicon into use.
FastHan uses CTB label sets for POS tagging and
dependency parsing tasks, and uses MSRA label
set for NER.
Besides, users can call the
set device
function
to change the device utilized by the backbone
model. Using GPU can greatly accelerate the pre-
diction and fine-tuning of fastHan.
3.3 Advanced Features
In addition to using fastHan as a off the shelf
model, users can utilize user lexicon and fine-
tuning to enhance the performance of fastHan. As
for user lexicon, users can call the
add user dict
function to add their lexicon, and call the
set user dict weight
function to change the
weight coefficient. As for fine-tuning, users can
call the
finetune
function to load the formatted
data, make fine-tuning, and save the model param-
eters.
Users can change the segmentation style by call-
ing the
set cws style
function. Each CWS corpus
has different granularity and coverage. By chang-
ing the corpus tag, fastHan will segment words in
the style of the corresponding corpus.
4 Evaluation
We evaluate fastHan in terms of accuracy, transfer-
ability, and execution speed.
4.1 Accuracy Test
The accuracy test is performed on the test set of
training data. We refer to the CWS corpora used by
(Chen et al.,2015;Huang et al.,2019), including
PKU, MSR, AS, CITYU (Emerson,2005), CTB-6
(Xue et al.,2005), SXU (Jin and Chen,2008), UD,
CNC, WTB (Wang et al.,2014) and ZX (Zhang
et al.,2014). More details can be found in (Huang
et al.,2019). For POS tagging and dependency
parsing, we use the Penn Chinese Treebank 9.0
(CTB-9) (Xue et al.,2005). For NER, we use
MSRA’s NER dataset and OntoNotes.
We conduct an additional set of experiments to
make the base version of fastHan trained on each
task separately. The final results are shown in Ta-
ble 1. Both base and large models perform satis-
factorily. The result shows that multi-task learn-
ing greatly improves fastHan’s performance on all
tasks. The large version of fastHan outperforms
the current best model in CWS and POS. Although
fastHan’s score on NER and dependency parsing
is not the best, the parameters used by fastHan are
reduced by one-third due to layer prune. FastHan’s
performance on NER can also be enhanced by a
user lexicon with a high recall rate.
We also conduct an experiment about user lexi-
con on 10 CWS corpus respectively. With each cor-
pus, a word is added to the lexicon once it has ap-
peared in the training set. With such a low-quality
lexicon, fastHan’s score increases by an average
of 0.127 percentage points. It is feasible to use
104
Model CWS Dependency Parsing POS NER MSRA NER OntoNotes
F Fudep,Fldep F F F
SOTA models 97.1 85.66,81.71 93.15 96.09 81.82
fastHan base trained separately 97.15 80.2, 75.12 94.27 92.2 80.3
fastHan base trained jointly 97.27 81.22, 76.71 94.88 94.33 82.86
fastHan large trained jointly 97.41 85.52, 81.38 95.66 95.50 83.82
Table 1: The results of fastHan’s accuracy result. The score of CWS is the average of 10 corpora. When training
dependency parsing separately, the biaffine parser use the same architecture as Yan et al. (2020). SOTA models are
best-performing work we know for each task. They came from Huang et al. (2019), Yan et al. (2020), Meng et al.
(2019), Li et al. (2020) in order. Li et al. (2020) uses lexicon to enhance the model.
user lexicon to enhance fastHan’s performance in
specific domains.
4.2 Transferability Test
Segmentation Tool Weibo Test Set
jieba 83.58
SnowNLP 79.65
THULAC 86.65
LTP-4.0 92.05
fastHan 93.38
fastHan(fine-tuned) 96.64
Table 2: Transfer test for fastHan, using span F metric.
We use the test set of Weibo, which has 8092 samples.
For LTP-4.0, we use the base version, which has the
best performance among their models.
For an NLP toolkit designed for the open do-
main, the ability of processing samples not in the
training corpus is very important. We perform the
transfer test on Weibo (Qiu et al.,2016), which
has no overlap with our training data. Samples
in Weibo
5
come from the Internet, and they are
complex enough to test the model’s transferabil-
ity. We choose to test on CWS because nearly all
Chinese NLP tools have this feature. We choose
popular toolkits as the contrast, including Jieba
6
,
THULAC
7
, SnowNLP
8
and LTP-4.0
9
. We also per-
form a test of fine-tuning using the training set of
Weibo.
The results are shown in Table 2. As a off the
shelf model, FastHan outperforms jieba, SnowNLP,
and THULAC a lot. LTP-4.0 (Che et al.,2020) is
another technical route for multi-task Chinese NLP,
which is released after the first release of fastHan.
However, FastHan still outperforms LTP with a
5https://github.com/FudanNLP/
NLPCC-WordSeg- Weibo
6https://github.com/fxsjy/jieba
7https://github.com/thunlp/THULAC
8https://github.com/isnowfy/snownlp
9https://github.com/HIT-SCIR/ltp
much smaller model (262MB versus 492MB). The
result proves fastHan is robust to new samples, and
the fine-tuning feature allows fastHan to better be
adapted to new criteria.
4.3 Speed Test
Models Dependency Parsing Other Tasks
CPU, GPU CPU, GPU
fastHan base 25, 22 55, 111
fastHan large 14, 21 28, 97
Table 3: Speed test for fastHan. The numbers in the
table represent the average number of sentences pro-
cessed per second.
The speed test was performed on a personal
computer configured with Intel Core i5-9400f +
NVIDIA GeForce GTX 1660ti. The test was con-
ducted on the first 800 sentences of the CTB CWS
corpus, with an average of 45.2 characters per sen-
tence and a batch size of 8.
The results are shown in Table 3. Dependency
parsing runs slower, and the other tasks run at about
the same speed. The base model with GPU per-
forms poorly in dependency parsing because depen-
dency parsing requires a lot of CPU calculations,
and the acceleration effect of GPU is less than the
burden of information transfer.
5 Conclusion
In this paper, we presented fastHan, a BERT-based
toolkit for CWS, NER, POS, and dependency
parsing in Chinese NLP. After our optimization,
fastHan has the characteristics of high accuracy,
small size, strong transferability, and ease of use.
In the future, we will continue to improve the
fastHan with better performance, more features
and more efficient learning methods, such as meta-
learning (Ke et al.,2021).
105
Acknowledgements
This work was supported by the National Key
Research and Development Program of China
(No. 2020AAA0106700), National Natural Sci-
ence Foundation of China (No. 62022027) and
Major Scientific Research Project of Zhejiang Lab
(No. 2019KD0AD01).
References
Wanxiang Che, Yunlong Feng, Libo Qin, and Ting Liu.
2020. N-ltp: A open-source neural chinese language
technology platform with pretrained models. arXiv
preprint arXiv:2009.11616.
Xinchi Chen, Xipeng Qiu, and Xuanjing Huang.
2017a. A feature-enriched neural model for joint
chinese word segmentation and part-of-speech tag-
ging. In Proceedings of the Twenty-Sixth Inter-
national Joint Conference on Artificial Intelligence,
IJCAI-17, pages 3960–3966.
Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu,
and Xuanjing Huang. 2015. Long short-term mem-
ory neural networks for chinese word segmenta-
tion. In Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing, pages
1197–1206.
Xinchi Chen, Zhan Shi, Xipeng Qiu, and XuanJing
Huang. 2017b. Adversarial multi-criteria learning
for chinese word segmentation. In Proceedings
of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 1193–1203.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.
Timothy Dozat and Christopher D Manning. 2016.
Deep biaffine attention for neural dependency pars-
ing. arXiv preprint arXiv:1611.01734.
Thomas Emerson. 2005. The second international chi-
nese word segmentation bakeoff. In Proceedings of
the fourth SIGHAN workshop on Chinese language
Processing.
Weipeng Huang, Xingyi Cheng, Kunlong Chen,
Taifeng Wang, and Wei Chu. 2019. Toward
fast and accurate neural chinese word segmenta-
tion with multi-criteria learning. arXiv preprint
arXiv:1903.04190.
Guangjin Jin and Xiao Chen. 2008. The fourth inter-
national chinese language processing bakeoff: Chi-
nese word segmentation, named entity recognition
and chinese pos tagging. In Proceedings of the sixth
SIGHAN workshop on Chinese language process-
ing.
Zhen Ke, Liang Shi, Songtao Sun, Erli Meng, Bin
Wang, and Xipeng Qiu. 2021. Pre-training with
meta learning for Chinese word segmentation. In
Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 5514–5523, Online. Association for Compu-
tational Linguistics.
John Lafferty, Andrew McCallum, and Fernando CN
Pereira. 2001. Conditional random fields: Prob-
abilistic models for segmenting and labeling se-
quence data. In ICML.
Xiaonan Li, Hang Yan, Xipeng Qiu, and Xuanjing
Huang. 2020. FLAT: Chinese NER using flat-lattice
transformer. In Proceedings of the 58th Annual
Meeting of the Association for Computational Lin-
guistics, pages 6836–6842, Online. Association for
Computational Linguistics.
Yuxian Meng, Wei Wu, Fei Wang, Xiaoya Li, Ping Nie,
Fan Yin, Muyu Li, Qinghong Han, Xiaofei Sun, and
Jiwei Li. 2019. Glyce: Glyph-vectors for chinese
character representations. In Advances in Neural In-
formation Processing Systems, pages 2746–2757.
Xipeng Qiu, Peng Qian, and Zhan Shi. 2016. Overview
of the NLPCC-ICCPOL 2016 shared task: Chinese
word segmentation for micro-blog texts. In Proceed-
ings of The Fifth Conference on Natural Language
Processing and Chinese Computing & The Twenty
Fourth International Conference on Computer Pro-
cessing of Oriental Languages.
Xipeng Qiu, TianXiang Sun, Yige Xu, Yunfan Shao,
Ning Dai, and Xuanjing Huang. 2020. Pre-trained
models for natural language processing: A sur-
vey.SCIENCE CHINA Technological Sciences,
63(10):1872–1897.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information pro-
cessing systems, pages 5998–6008.
William Yang Wang, Lingpeng Kong, Kathryn
Mazaitis, and William Cohen. 2014. Dependency
parsing for weibo: An efficient probabilistic logic
programming approach. In Proceedings of the 2014
conference on empirical methods in natural lan-
guage processing (EMNLP), pages 1152–1158.
Pak-kwong Wong and Chorkin Chan. 1996. Chinese
word segmentation based on maximum matching
and word binding force. In Proceedings of the 16th
Conference on Computational Linguistics - Volume
1, COLING ’96, page 200–203, USA. Association
for Computational Linguistics.
Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei,
and Ming Zhou. 2020. Bert-of-theseus: Compress-
ing bert by progressive module replacing. arXiv
preprint arXiv:2002.02925.
106
Naiwen Xue, Fei Xia, Fu-Dong Chiou, and Marta
Palmer. 2005. The penn chinese treebank: Phrase
structure annotation of a large corpus. Natural lan-
guage engineering, 11(2):207.
Hang Yan, Xipeng Qiu, and Xuanjing Huang. 2020. A
graph-based model for joint chinese word segmen-
tation and dependency parsing. Transactions of the
Association for Computational Linguistics, 8:78–92.
Meishan Zhang, Yue Zhang, Wanxiang Che, and Ting
Liu. 2014. Type-supervised domain adaptation for
joint segmentation and pos-tagging. In Proceed-
ings of the 14th Conference of the European Chap-
ter of the Association for Computational Linguistics,
pages 588–597.
Yu Zhang, Zhenghua Li, Houquan Zhou, and Min
Zhang. 2020. Is pos tagging necessary or even help-
ful for neural dependency parsing? arXiv preprint
arXiv:2003.03204.
... The transcripts generated by Tencent Cloud were manually checked and revised by research personnel who were native Cantonese speakers. For word segmentation, we employed a deep learning-based Chinese word segmentation engine fastHan [22]. After segmentation, words were translated into linguistic and psychologically meaningful categories using the Chinese version of the Language Inquiry and Word Count (LIWC) dictionary [23]. ...
Article
Full-text available
There is an emerging potential for digital assessment of depression. In this study, Chinese patients with major depressive disorder (MDD) and controls underwent a week of multimodal measurement including actigraphy and app-based measures (D-MOMO) to record rest-activity, facial expression, voice, and mood states. Seven machine-learning models (Random Forest [RF], Logistic regression [LR], Support vector machine [SVM], K-Nearest Neighbors [KNN], Decision tree [DT], Naive Bayes [NB], and Artificial Neural Networks [ANN]) with leave-one-out cross-validation were applied to detect lifetime diagnosis of MDD and non-remission status. Eighty MDD subjects and 76 age- and sex-matched controls completed the actigraphy, while 61 MDD subjects and 47 controls completed the app-based assessment. MDD subjects had lower mobile time (P = 0.006), later sleep midpoint (P = 0.047) and Acrophase (P = 0.024) than controls. For app measurement, MDD subjects had more frequent brow lowering (P = 0.023), less lip corner pulling (P = 0.007), higher pause variability (P = 0.046), more frequent self-reference (P = 0.024) and negative emotion words (P = 0.002), lower articulation rate (P < 0.001) and happiness level (P < 0.001) than controls. With the fusion of all digital modalities, the predictive performance (F1-score) of ANN for a lifetime diagnosis of MDD was 0.81 and 0.70 for non-remission status when combined with the HADS-D item score, respectively. Multimodal digital measurement is a feasible diagnostic tool for depression in Chinese. A combination of multimodal measurement and machine-learning approach has enhanced the performance of digital markers in phenotyping and diagnosis of MDD.
... As interword spacing is absent in Chinese texts (eg, "I want to kill myself" in Chinese would become "Iwanttokillmyself"), Chinese word segmentation was needed to separate words. For Chinese word segmentation, the study used a deep learning-based Chinese word segmentation engine, fastHan, which included local text samples for training and testing its segmentation model, achieving over 90% agreement with human segmentation [31]. ...
Article
Full-text available
Background Assessing patients’ suicide risk is challenging, especially among those who deny suicidal ideation. Primary care providers have poor agreement in screening suicide risk. Patients’ speech may provide more objective, language-based clues about their underlying suicidal ideation. Text analysis to detect suicide risk in depression is lacking in the literature. Objective This study aimed to determine whether suicidal ideation can be detected via language features in clinical interviews for depression using natural language processing (NLP) and machine learning (ML). Methods This cross-sectional study recruited 305 participants between October 2020 and May 2022 (mean age 53.0, SD 11.77 years; female: n=176, 57%), of which 197 had lifetime depression and 108 were healthy. This study was part of ongoing research on characterizing depression with a case-control design. In this study, 236 participants were nonsuicidal, while 56 and 13 had low and high suicide risks, respectively. The structured interview guide for the Hamilton Depression Rating Scale (HAMD) was adopted to assess suicide risk and depression severity. Suicide risk was clinician rated based on a suicide-related question (H11). The interviews were transcribed and the words in participants’ verbal responses were translated into psychologically meaningful categories using Linguistic Inquiry and Word Count (LIWC). Results Ordinal logistic regression revealed significant suicide-related language features in participants’ responses to the HAMD questions. Increased use of anger words when talking about work and activities posed the highest suicide risk (odds ratio [OR] 2.91, 95% CI 1.22-8.55; P =.02). Random forest models demonstrated that text analysis of the direct responses to H11 was effective in identifying individuals with high suicide risk (AUC 0.76-0.89; P <.001) and detecting suicide risk in general, including both low and high suicide risk (AUC 0.83-0.92; P <.001). More importantly, suicide risk can be detected with satisfactory performance even without patients’ disclosure of suicidal ideation. Based on the response to the question on hypochondriasis, ML models were trained to identify individuals with high suicide risk (AUC 0.76; P <.001). Conclusions This study examined the perspective of using NLP and ML to analyze the texts from clinical interviews for suicidality detection, which has the potential to provide more accurate and specific markers for suicidal ideation detection. The findings may pave the way for developing high-performance assessment of suicide risk for automated detection, including online chatbot-based interviews for universal screening.
Chapter
To prevent other types of mental health problems from being misclassified as depression, as well as to remedy the problem of inadequate resources for mental health consultations. This study first analyzes the types of different causes of mental health problems, providing an important basis for better understanding the diversity and complexity of this field. Subsequently, a machine learning approach was used to predict the potential causes of different types of mental health problems. This research provides new perspectives and methods for early identification and personalized treatment of mental health problems. The experimental results show that depression accounts for only 16.9% of mental health problems. In the prediction of the causes of mental health problems, the SVM method performed best in predicting the causes of mental health problems, outperforming 5 machine learning methods and 3 deep learning methods. Through these studies, we hope to prevent other types of mental health problems from being misclassified as depression and to remedy the lack of resources for mental health counseling. This will help increase the success rate of early intervention and provide better mental health support for patients.
Chapter
Biomedical causal relation extraction is an important task. It aims to analyze biomedical texts and extract structured information such as named entities, semantic relations and function type. In recent years, some related works have largely improved the performance of biomedical causal relation extraction. However, they only focus on contextual information and ignore external knowledge. In view of this, we introduce entity information from external knowledge base as a prompt to enrich the input text, and propose a causal relation extraction framework JNT_KB incorporating entity information to support the underlying understanding for causal relation extraction. Experimental results show that JNT_KB consistently outperforms state-of-the-art extraction models, and the final extraction performance F1 score in Stage 2 is as high as 61.0%.
Article
In this article, we introduce the Chinese Children's Lexicon of Oral Words (CCLOOW), the first lexical database based on animated movies and TV series for 3-to-9-year-old Chinese children. The database computes from 2.7 million character tokens and 1.8 million word tokens. It contains 3920 unique character and 22,229 word types. CCLOOW reports frequency and contextual diversity metrics of the characters and words, as well as length and syntactic categories of the words. CCLOOW frequency and contextual diversity measures correlated well with other Chinese lexical databases, particularly well with that computed from children's books. The predictive validity of CCLOOW measures were confirmed with Grade 2 children's naming and lexical decision experiments. Further, we found that CCLOOW frequencies could explain a considerable proportion in adults' written word recognition, indicating that early language experience might have lasting impacts on the mature lexicon. CCLOOW provides validated frequency and contextual diversity estimates that complements current children's lexical database based on written language samples. It is freely accessible online at https://www.learn2read.cn/ccloow .
Article
Full-text available
Named entity recognition (NER) plays a crucial role in many downstream natural language processing (NLP) tasks. It is challenging for Chinese NER because of certain features of Chinese. Recently, large-scaled pre-training language models have been used in Chinese NER. However, since some of the pre-training language models do not use word information or just employ word information of single granularity, the semantic information in sentences could not be fully captured, which affects these models' performance. To fully take advantage of word information and obtain richer semantic information, we propose a multi-granularity word fusion method for Chinese NER. We introduce multi-granularity word information into our model. To make full use of the information, we classify the information into three kinds: strong information, moderate information, and weak information. These kinds of information are encoded by encoders and then integrated with each other through the strong-weak feedback attention mechanism. Specifically, we apply two separate attention networks to word embeddings and n-grams embeddings. Then, the outputs are fused into another attention. In these three attentions, character embeddings are used to be the query of attentions. We call the results the multi-granularity word information. To combine character information and multi-granularity word information, we introduce two fusion strategies for better performance. The process makes our model obtain rich semantic information and reduces word segmentation errors and noise in an explicit way. We design experiments to get our model's best performance by comparing some components. Ablation study is used to verify the effectiveness of each module. The final experiments are conducted on four Chinese NER benchmark datasets and the F1 scores are 81.51% for Ontonotes4.0, 95.47% for MSRA, 95.87% for Resume, and 69.41% for Weibo. The best improvement achieved by the proposed method is 1.37%. Experimental results show that our method outperforms most baselines and achieves the state-of-the-art method in performance.
Article
Full-text available
Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.
Chapter
Full-text available
In the pre deep learning era, part-of-speech tags have been considered as indispensable ingredients for feature engineering in dependency parsing. But quite a few works focus on joint tagging and parsing models to avoid error propagation. In contrast, recent studies suggest that POS tagging becomes much less important or even useless for neural parsing, especially when using character-based word representations. Yet there are not enough investigations focusing on this issue, both empirically and linguistically. To answer this, we design and compare three typical multi-task learning framework, i.e., Share-Loose, Share-Tight, and Stack, for joint tagging and parsing based on the state-of-the-art biaffine parser. Considering that it is much cheaper to annotate POS tags than parse trees, we also investigate the utilization of large-scale heterogeneous POS tag data. We conduct experiments on both English and Chinese datasets, and the results clearly show that POS tagging (both homogeneous and heterogeneous) can still significantly improve parsing performance when using the Stack joint framework. We conduct detailed analysis and gain more insights from the linguistic aspect.
Article
Full-text available
Chinese word segmentation and dependency parsing are two fundamental tasks for Chinese natural language processing. The dependency parsing is defined at the word-level. Therefore word segmentation is the precondition of dependency parsing, which makes dependency parsing suffer from error propagation and unable to directly make use of character-level pre-trained language models (such as BERT). In this paper, we propose a graph-based model to integrate Chinese word segmentation and dependency parsing. Different from previous transition-based joint models, our proposed model is more concise, which results in fewer efforts of feature engineering. Our graph-based joint model achieves better performance than previous joint models and state-of-the-art results in both Chinese word segmentation and dependency parsing. Additionally, when BERT is combined, our model can substantially reduce the performance gap of dependency parsing between joint models and gold-segmented word-based models. Our code is publicly available at https://github.com/fastnlp/JointCwsParser
Conference Paper
Recently, neural network models for natural language processing tasks have been increasingly focused on for their ability of alleviating the burden of manual feature engineering. However, the previous neural models cannot extract the complicated feature compositions as the traditional methods with discrete features. In this work, we propose a feature-enriched neural model for joint Chinese word segmentation and part-of-speech tagging task. Specifically, to simulate the feature templates of traditional discrete feature based models, we use different filters to model the complex compositional features with convolutional and pooling layer, and then utilize long distance dependency information with recurrent layer. Experimental results on five different datasets show the effectiveness of our proposed model.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Conference Paper
In this paper, we give an overview for the shared task at the 5th CCF Conference on Natural Language Processing & Chinese Computing (NLPCC 2016): Chinese word segmentation for micro-blog texts. Different with the popular used newswire datasets, the dataset of this shared task consists of the relatively informal micro-texts. Besides, we also use a new psychometric-inspired evaluation metric for Chinese word segmentation, which addresses to balance the very skewed word distribution at different levels of difficulty. The data and evaluation codes can be downloaded from https:// github. com/ FudanNLP/ NLPCC-WordSeg-Weibo.