ArticlePDF Available

Screening patents of ICT in construction using deep learning and NLP techniques

Authors:

Abstract and Figures

Purpose This study proposes an approach to solve the fundamental problem in using query-based methods (i.e. searching engines and patent retrieval tools) to screen patents of information and communication technology in construction (ICTC). The fundamental problem is that ICTC incorporates various techniques and thus cannot be simply represented by man-made queries. To investigate this concern, this study develops a binary classifier by utilizing deep learning and NLP techniques to automatically identify whether a patent is relevant to ICTC, thus accurately screening a corpus of ICTC patents. Design/methodology/approach This study employs NLP techniques to convert the textual data of patents into numerical vectors. Then, a supervised deep learning model is developed to learn the relations between the input vectors and outputs. Findings The validation results indicate that (1) the proposed approach has a better performance in screening ICTC patents than traditional machine learning methods; (2) besides the United States Patent and Trademark Office (USPTO) that provides structured and well-written patents, the approach could also accurately screen patents form Derwent Innovations Index (DIX), in which patents are written in different genres. Practical implications This study contributes a specific collection for ICTC patents, which is not provided by the patent offices. Social implications The proposed approach contributes an alternative manner in gathering a corpus of patents for domains like ICTC that neither exists as a searchable classification in patent offices, nor is accurately represented by man-made queries. Originality/value A deep learning model with two layers of neurons is developed to learn the non-linear relations between the input features and outputs providing better performance than traditional machine learning models. This study uses advanced NLP techniques lemmatization and part-of-speech POS to process textual data of ICTC patents. This study contributes specific collection for ICTC patents which is not provided by the patent offices.
Content may be subject to copyright.
Screening patents of ICT in
construction using deep learning
and NLP techniques
Hengqin Wu
Department of Building and Real Estate, The Hong Kong Polytechnic University,
Kowloon, Hong Kong and
School of Management, Harbin Institute of Technology, Harbin, China
Geoffrey Shen
Department of Building and Real Estate, The Hong Kong Polytechnic University,
Kowloon, Hong Kong
Xue Lin
School of Government, Nanjing University, Nanjing, China
Minglei Li
Huawei Technologies Co Ltd, Shenzhen, China
Boyu Zhang
Department of Building and Real Estate, The Hong Kong Polytechnic University,
Kowloon, Hong Kong and
Department of Standards and Codes, China Academy of Building Research,
Beijing, China, and
Clyde Zhengdao Li
College of Civil and Transportation Engineering, Shenzhen University,
Shenzhen, China
Abstract
Purpose This study proposes an approach to solve the fundamental problem in using query-based methods
(i.e. searching engines and patent retrieval tools) to screen patents of information and communication
technology in construction (ICTC). The fundamental problem is that ICTC incorporates various techniques and
thus cannot be simply represented by man-made queries. To investigate this concern, this study develops a
binary classifier by utilizing deep learning and NLP techniques to automatically identify whether a patent is
relevant to ICTC, thus accurately screening a corpus of ICTC patents.
Design/methodology/approach This study employs NLP techniques to convert the textual data of
patents into numerical vectors. Then, a supervised deep learning model is developed to learn the relations
between the input vectors and outputs.
Findings The validation results indicate that (1) the proposed approachhas a better performance inscreening
ICTC patents than traditional machine learning methods; (2) besides the United States Patent and Trademark
Office (USPTO) that provides structured and well-written patents, the approach could also accurately screen
patents form Derwent Innovations Index (DIX), in which patents are written in different genres.
Practical implications This study contributes a specific collection for ICTC patents, which is not provided
by the patent offices.
Screening
patents of ICT
in construction
We are grateful for the extremely helpful feedback we received from Xiao Li, Juan Huang, Bingxia Sun,
and other members of the Sustainable Construction Lab, The Hong Kong Polytechnic University,
Hong Kong.
Funding: This research was supported by the National Natural Science Foundation of China (NSFC)
(No. 71771067, No. 71801159), the National Natural Science Foundation of Guangdong Province (No.
2018A030310534), and Youth Fund of Humanities and Social Sciences Research of the Ministry of
Education (No. 18YJCZH090).
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/0969-9988.htm
Received 17 September 2019
Revised 3 January 2020
6 February 2020
11 February 2020
Accepted 12 February 2020
Engineering, Construction and
Architectural Management
© Emerald Publishing Limited
0969-9988
DOI 10.1108/ECAM-09-2019-0480
Social implications The proposed approach contributes an alternative manner in gathering a corpus of
patents for domains like ICTC that neither exists as a searchable classification in patent offices, nor is
accurately represented by man-made queries.
Originality/value A deep learning model with two layers of neurons is developed to learn the non-linear
relations between the input features and outputs providing better performance than traditional machine learning
models. This study uses advanced NLP techniqueslemmatization and part-of-speech POS to process t extual data of
ICTC patents. This study contributes specific collection for ICTC patents which is not provided by the patent offices.
Keywords ICT in construction, NLP, Deep learning, Information management
Paper type Research paper
1. Introduction
1.1 Research background
Information and communication technology (ICT) has been recognized as a key determinant to
improve the level of coordination and collaboration in the architectural, engineering and
construction (AEC) industry (Davies and Harty, 2013;Wu et al., 2017). Yet, compared with other
industries, the overall adoption rate of ICT in the AEC industry is low (Ahuja et al., 2009), and
only a few number of regular and conventional ICTs such as 2D drawings are widely adopted.
Regardless of the widely recognized benefits, most of the advanced and novel ICTs applications
such as GPS, 4D modeling, BIM and mobiles are still incidentally employed in the industry
(Ahuja et al., 2010;Dehlin and Olofsson, 2008;Frits, 2007;Li et al.,2019). On e of the major barriers
is that construction practitioners always lack technological knowledge about information and
communication technology in construction (ICTC; Adriaanse et al., 2010;Sardroud, 2015).
Up to 80% technological information is exclusively provide by patents recognized as one
of the most valuable resources for technical analysis (Chiarello et al., 2018;Hoetker and
Agarwal, 2007;Terragno, 1979). The content archived in a patent document normally
expresses scientific and technological information for the technology application in terms of
main machines and approaches involved, basic functions of the application, process whereby
the application implements and solutions to problems (Intarakumnerd and Charoenporn,
2015). Therefore, a corpus of patents that widely covers the inventions of ICTC is a valuable
database, not only providing a dictionary for accessing ICTC, but also identifying problems
to be solved by the state of art ICTC inventions and recognizing all possible specific
embodiments of ICTC (El-Ghandour and Al-Hussein, 2004).
However, such a corpus of ICTC patents does not exist. Table 1 provides the existing
patent classes for the AEC industry in the three major patent offices, including World
Intellectual Property Organization, European Patent Office, and United States Patent and
Trademark Office (USPTO). Table 1 shows that none of the patent offices provide a
searchable classification for ICTC. Two offices, WIPO and USPTO, provide a specific
category of patents that are relevant to the AEC industry, namely, E and D25, respectively.
The two classes focus on inventions about building materials and fixed construction rather
than information and communication technologies.
1.2 The problem of retrieving ICTC patents by using query-based methods
In the absence of a searchable classification for ICTC in the patent offices, query-based methods
(including the searching engines and other patent retrieval methods) became a possible way for
users to retrieve the patents. These query-based methods aim to retrieve all documents that are
relevant to a given patent application according to a query. Hence, the accuracy and coverage of
retrieval results highly depend on the query (Zhang et al., 2018), which can be formed by a variety
of items, such as keywords, citations, authors, granted year, application date or combinations of
them. The core technique that lies in the query-based methods is query reformulation
converting the input query into new and more searchable queries (Alberts et al.,2017;
Shalaby and Zadrozny, 2019). The query reformulation, including query reduction techniques
(Bouadjenek et al.,2015;Mahdabi et al., 2011), query expanding method (mainly by external
ECAM
dictionary and corpus or ontologies) (Azad and Deepak, 2019;Enesi et al., 2018;Tannebaum and
Rauber, 2014), semantic-based methods (Girthana and Swamynathan, 2018), metadata-based
methods (citations and classification) (Azad and Deepak, 2019;Giachanou et al., 2015;Mahdabi
and Crestani, 2014) and interactive methods (Shalaby and Zadrozny, 2018), enriched the query-
based methods and have obtained performance improvement in recent studies.
However, gathering the corpus of ICTC patents may not achieve good performance by using
the query-based methods, because it is extremely challenging to accurately represent and
widely coverthe ICTC patents by man-made queries.The patent retrieval tasks, including prior-
art search,patentability search and infringement search (Zhang et al., 2018), aim to return a wide
coverage of patent documents that are relevant to a patent application according to a query,
helping potential patentees check and analyze relevant information before the patent
application is granted. Therefore, the queries are frequently used to represent a specific patent
application rather than a set of patents. ICTC, standing for a set of ICT applications that were
invented with major embodiments in the AEC industry (Ahuja et al.,2009;Alsafouri and Ayer,
2018), incorporates a numberof technologies whichmay vary with each other (for example, both
BIM and RFID are important ICT applicationsin the AEC industry, but they are totally different
technologies). Therefore, completely representing all the ICTC patents by a man-made query
leaves a tough task to return accurate results. Moreover, using a query combined by a number
of items to represent all the ICTC patents increases the irrelevant instances returned due to
polysemy (the same spellings may have two or more different meanings).
This study performs two trails for retrieving ICTC patents from the USPTO website
(USPTO, 2007), based on a query combined by a number of items. Table 2 shows the results
using this query-based method, and 50 patents were randomly selected to manually check the
accuracy the proportion the ICTC patents occupy all the retrieved patents. Even though a
complicated combination of items were used to search ICTC patents, the accuracy is low.
Classification scheme Organizations
The specific classification of patents
related to the AEC industry
International Patent
Classification (IPC)
World intellectual property
organization (WIPO)
E: Fixed constructions
E01. Construction of roads, railways or
bridges
E02. Hydraulic engineering, foundations,
soil-shifting
E03. Water supply, sewerage
E04. Building
E05. Locks, keys, window or door fittings,
safes
E06. Doors, windows, shutters, or roller
blinds, in general; ladders
E21. Earth or rock drilling, mining
E99. Subject matter not otherwise
provided for in this section
Cooperative Patent
Classification (CPC)
European patent office (EPO) None
United States patent
classification (USPC)
USPTO D25: Building units and construction
elements
1. Structure
2. Prefabricated unit
3. Stair, ladder, scaffold or similar support
4. Trellis or treillage unit
5. Architectural stock material
Table 1.
Existing classification
schemes in the three
major patent offices
Screening
patents of ICT
in construction
Moreover, most of the latent users in the construction practice are non-experts, who may not
be able to perform such a searching task that is complex and time-consuming (Liu et al., 2011).
1.3 Research objectives
Given the aforementioned constraints of existing query-based methods, this study develops a
binary classifier to automatically identify whether a patent is relevant to ICTC, and thus
accurately screening a corpus of ICTC patents from the primarily searched results containing a
number of irrelevant patents. Therefore, this study treats the task of screening ICTC patents as a
classification task rather than a retrieval task. A large number of studies have investigated
patent classification, and most of them emphasized the use of traditional machine learning (i.e.
SVM and Bayes) and textmining techniques (i.e. N-gram andstop-words removal) (Li et al., 2012;
Wu et al., 2010). Alternatively, this study resorts to the techniques from the realm of NLP and
deep learning. On the one hand, NLP techniques provide a smart way to process textual data
(Kurdi, 2017), saving time and avoiding personal bias in analysis processes (Agrawal and
Henderson, 2002;Bell et al., 2009;Cassetta et al., 2017;Choi et al., 2012;Gwak and Sohn, 2018),
especially when the volume of a text is large (Shekarpour et al., 2015;Silva et al.,2016). On the
other hand, a deep learning model, multi-layer perceptron (MLP), is developed to learn the
relations between the input features and outputs. Deep learning is the most state-of-the-art
approach, with significant performance improvement in NLP tasks. Compared with the
algorithms and statistics of the machine learning models, the deep learning models are organized
by multiple layers of neural networks. Each layer consists of neurons, receiving signals from the
former layer and passing converted signals by activation functions to the subsequent layer
(Riedmiller, 1994). With the multiple layers of neural networks, the whole deep learning model
can address highly non-linear associations between the representations and the outputs (Wang
et al.,2016), whereas the machine learning algorithms can only examine linear relations.
Querying strategies
Matched
results Accuracy
Strategy
1
Query items: CPC Classification Class and topic (matching input
keywords within patent titles, abstracts and descriptions)
CPC Classification Class: ICT-related classes, including H04
electric communication technique; G06 computing, calculating or
counting; H01P waveguides, resonators, lines, or other devices of
the waveguide type; H01Q antennas, i.e. radio aerials; G01S radio
direction-finding, radio navigation, determining distance or velocity
by use of radio waves, locating or presence-detecting by use of the
reflection or re-radiation of radio waves or analogous arrangements
using other waves; G08B signaling or calling systems, order
telegraphs or alarm systems; G08C transmission systems for
measured values, control or similar signals; G11B information
storage based on relative movement between record carrier and
transducer
Keywords: AEC domain terms, including construction project,
project management, infrastructure project, civil engineering and
transportation project. (Flyvbjerg, 2014;Greiman, 2013;Levitt, 2007;
Mok et al., 2015;Zidane et al., 2013)
Collection 1
5,311 patents
8%
Strategy
2
Query item: Topic
Keywords: ICT-related terms (radio frequency identification (RFID),
3D laser scanning, quick response, NFC, augmented reality (AR),
mobile computing, wireless connection (Wi-Fi) and robotics(drones))
(Ahuja et al., 2009;Alsafouri and Ayer, 2018;Li et al., 2016) and AEC
domain terms
Collection 2
922 patents
12%
Table 2.
Searching results by
using the search engine
in USPTO
ECAM
2. Related work
Several attempts have been made to establish classifiers for automatic patent classification
(Chakrabarti et al., 1998;Smith, 2002). Most of the studies, at the beginning, extracted the
features from the structured data or metadata, such as keywords and citations (Michel and
Bettels, 2001;Perez-Molina, 2018). In the recent decade, scholars prefer using unstructured
data (Cambria and White, 2014;Collobert et al., 2011;Gimpel et al., 2011). These studies
typically have three key steps: processing the textual data, vectorizing the patents and using
machine learning methods to train the models. Focusing on the three key steps, this section
describes a synopsis of the relevant literature.
2.1 NLP techniques for processing textual data
A large volume of unformatted texts exist in the current web 2.0 era (Ittoo et al., 2016). The
chunk of information is mainly unstructured, and thus cannot be processed by machine-
tractable ways (Cambria and White, 2014) that structured data can be. Processing the
unstructured data is regarded as one of the most time-consuming step of text classification
tasks (Munkov
aet al., 2013). The major object of the processing is to clean and format the raw
textual data, which can largely eliminate noisy features for further vectorization (Haddi et al.,
2013). Many NLP tools have been introduced, such as stop and common words removal,
tokenization, lemmatization and stemming (Aggarwal and Reddy, 2013). Two typical tools
are part-of-speech (POS) and named entity recognition, which can recognize and process
syntax information (grammatical meanings) (Collobert et al., 2011;Gimpel et al., 2011) and
named entities, respectively (Nadeau and Sekine, 2007).
2.2 Vectorizing methods
With regard to the vectorization, a number of algorithms have been developed to convert the
textual data into vectors. Bag-of-words (BOW), topic models and subjectactionobject
(SAO) have been used in recent patent classification studies (Li et al., 2018;Venugopalan and
Rai, 2015). Traditional BOW models typically construct the feature space vectors in which
each position is occupied by a term or a phrase (Forman, 2002). Its measurements include
N-grams, bi-grams and word frequency to identify phrases from the texts (Onan et al., 2016),
depending on how the phrases were counted. Although BOW models are simple and may
generate a large number of features, they remain the most effective feature selection method
(Miro
nczuk and Protasiewicz, 2018;Onan et al., 2016). Topic model and sSAO were mainly
developed to solve the high dimension problem, replacing the BOW features by latent topics
(Kaplan and Vakili, 2015) or SAO structures (Gerken, 2012).
2.3 Supervised learning models for patent classification
To date, most of the patent classifiers are trained by machine learning models, such as SVM,
Naive Bayes and k-nearest-neighbor. The accuracy was relatively low in earlier studies (around
70%) (Saiki et al., 2006), but an increasing trend has been observed when feature selection
models and NLP techniques were used to extract useful features from unstructured texts,
achieving accuracy around 85% (Venugopalan and Rai, 2015;Wu et al.,2010). In the recent
decade, the widespread use of deep learning models has led to notable success. Deep learning
models have been developed and adopted in a variety of fields, such as natural language
understanding, video and image recognition and game of go (Al Rahhal et al., 2018;Cocarascu
and Toni, 2018;Silver et al.,2016). Those deep learning models were always developed with
complex and elaborate architecture in which multiple layers of neural networks were well
structured. However, only the artificial neural network (ANN) with one layer neurons was
applied to patent classification, and the accuracy was relatively low, with 75% (Li et al., 2012).
Screening
patents of ICT
in construction
In addition, most of the researches have sought to classify patents into pre-defined classes
that already existed in the classification schemes in patent offices, such as International
Patent Classification (IPC) and European classification code hierarchy (Fall et al., 2003a,b).
Such expositions provide automatic and efficient methods for inventors and examiners to
label the new patents with existing classes, but do not provide opportunities to advance the
understanding of the real world in the target field. Although using deep learning models may
lead to better performance than traditional machine learning models, rare studies employ
deep learning models in automatic patent classification tasks.
3. The proposed approach integrating deep learning and NLP techniques
To screen ICTC patents from a patent collection, a binary classifier is developed to classify
pieces of patents into two classes: ICTC-related or not. Figure 1 shows the overall procedure of
the approach to achieve the classifier. The first step is to collect a database for training,
incorporating the full texts of the instances annotated with target labels (the twoclasses). Then,
NLP tools are used to process textual data to achieve clean texts. Based on the processed texts,
N-gram and tf-idf algorithms are employed for the vectorization to represent each of the patents
as a numerical vector that could be fed into the MLP that would be trained by gradient descent
in which the hyperparameters are tuned. At last,a validation experiment is conducted by means
of k-fold cross-validation in two data sets. The succeeding sections discuss these steps.
3.1 Data collection and annotation
Thetargetofthisstepistoobtaintrainingdatathe full texts of the patents that are
manually labeled as ICTC or non-ICTC. All the required patent text were crawled from
USPTO, because (1) USPTO is the largest international patent grant office, and (2) USPTO
is recognized as the most representative database to analyze the technological knowledge,
providing patents that are well written and structured according to its requirement
Figure 1.
Overall procedure of
the approach
ECAM
(Wang, 2018). The authors retrieved the patents on July 30, 2018. Totally, we have
collected and annotated 348 patents as the data set for further training and testing. The
detailed processes are described in the following two paragraphs.
Figure 2 depicts the data collection and annotation process, whereby patents were
collected and annotated as ICTC or non-ICTC. As for the ICTC class, patents were gathered in
the following steps: (1) by querying search strategy 1 in Table 1 (ICT classes and AEC domain
terms), 5,311 patents were obtained in collection 1, (2) totally 1,500 patents were randomly
selected from the 5,311 patents and (3) through the process of manually checking [1], 174
patents were obtained as ICTC class from the 1,500 patents.
As for the non-ICTC class, the patents were collected from two different sources. One was
through the annotation process mentioned above, in which 1,326 patents were identified as
non-ICTC class. The other was obtained from the patents retrieved by searching AEC domain
terms and excluding patents in collection 1. This results in a combined collection of 1,576 non-
ICTC patents, and 174 of them were randomly selected as training instances for the non-ICTC
class. The complex collection process has two advantages. (1) The non-ICTC class contains
not only common ICTs that exclude ICTC patents, but also the technologies of the AEC
industry that exclude ICTC patents. This can prevent the data overfitting, thus generating a
more generalized model that is able to distinguish ICTC patents from ICT, as well as from
AEC patents. (2) This study uses the negative sampling to make the two classes have the
same size, because the balanced size for each class is proven as a key factor to achieve high
accuracy in training (Brown and Mues, 2012;Zhao et al., 2015).
3.2 Data processing by NLP techniques
The raw text of each patent contains several sections (i.e. code, title, abstract, CPC classes,
inventors and countries and description). Among them, title,abstract and claim are frequently
utilized and remained for further analysis in this study because they were recognized as
useful items providing basic technological information (Niemann et al., 2017;Venugopalan
and Rai, 2015). Title and abstract convey the essence about the technology, which were
always written in a restricted pattern within short content (Lee et al., 2013). In addition, claim
defines the protection right of the invention, always providing articulated expressions about
the technical boundaries and specifications (Niemann et al., 2017).
The selected text of patents is raw data, which is pre-processed by NLP techniques for
further analysis. Without pre-processing, the texts would contain a lot of noisy features (in a
typical case, the number of features can be close to the number of words in the dictionary of
the training instances), thus creating higher-dimension vectors. To process the selected raw
text, this study employs three NLP techniques (Figure 3 plots the pre-processing procedure
Figure 2.
The process of data
collection
Screening
patents of ICT
in construction
using these techniques). (1) Tokenization. For each raw sentence in the texts, tokenization is
utilized to split the sentence into words. In addition, all the words are converted to lowercase,
and punctuations are removed. Through this step, all the raw sentences would be replaced
with sequenced and lowercase words. (2) POS tagging. In this step, each word is tagged with
POS tag indicating its syntactic role (i.e. noun, adverb) according to the surrounding words.
POS tagging plays a central role in text processing, which could increase the accuracy for
lemmatization and stemming (Habash et al., 2009). (3) Lemmatization and stop-words
removal. The purpose of this step is to correctly match the words with different forms, such as
plural forms for nouns and present and past forms for verbs. Lemmatization transforms the
different forms into the stem forms (root words). However, lemmatization may generate a
number of mistakes without POS tagging. For example, modelingmay be a present
participle of a verb (with lemma model) or a noun (with lemma modeling) according to the
context, and the lemma of noun modelingwould be wrongly identified as modelwithout
the POS information (Vlachidis and Tudhope, 2016). This study utilizes NLTK toolkits to
perform POS tagging and lemmatization (Bird and Loper, 2004). Moreover, stop-words (i.e. a,
an, of, one, two, three and so on) are removed, because they are non-descriptive and do not
convey any semantic meanings.
3.3 Vectorizing patents
The processed patent texts have to be converted into numerical vectors that can be fed into
MLP. This study adopts N-gram model with tf-idf weighting algorithm to vectorize the
patents. N-gram considers the Nwords in a sequence as a feature, which has been proposed in
the 1940s (Shannon, 1948) and has been employed in a large and growing body of literature
(Bengio et al., 2003;Benson and Magee, 2013). In this case, two typical N-gram models, N51
Figure 3.
The processing
procedure for textual
data in patents
ECAM
(unigram) and 2 (bigram) are used to extract unigrams and bigrams as features from the
patent texts, constituting the vocabulary with size v (overall v unigrams and bigrams are
identified from the patent texts). A vector with v-dimension in which each position is the tf-idf
(term frequency and inverse document frequency, see Sparck Jones (1972) for details) value of
the feature in the vocabulary is generated to represent a patent. Another necessary step is to
filter useful features because many of the features do not contribute to the training and
prediction. This study, according to the tf-idf vectors, uses ANOVA F-value to select top
features (number 5K). In this study, Kis set as a hyperparameter that would be tuned in the
optimization step.
This study adopts N-gram and tf-idf as the vectorizing approach but not the topic models
or SAO structures, because (1) BOW and tf-idf have been widely used in NLP studies and
have been proven as the prominent vectorizing algorithm due to the simplicity and
effectiveness (Garc
ıa Adeva et al., 2014;Miro
nczuk and Protasiewicz, 2018;Pavlinek and
Podgorelec, 2017); (2) Topic models and SAO structures are suitable in clustering or
classification tasks that have more than two classes to be distinguished (Blei et al., 2003;Choi
et al., 2012); (3) Topic models and SAO structures replace the N-grams with latent topics or
subject-active-objective structures. This would generate vectors with much lower
dimensions, which is not necessary in this case, because the proposed MLP model may get
better performance when the number of input features is large (Cakir and Yilmaz, 2014).
3.4 MLP architecture
To improve the performance and take the non-linear relations into consideration, this study
proposes MLP to learn and train the complex relations between inputs and outputs. MLP is
typically designed with a feed-forward-based architecture and back-propagation learning
process (Rosenblatt, 1961). There are a number of neurons in the MLP, and each of them
receives signals from the former layer and passes transformed signals by an activation
function to the subsequent layer (Riedmiller, 1994). Although it is a general wisdom that deep
learning models are better than machine learning models, neural network design and
hyperparameter choice are more important than deep learning models themselves (Levy et al.,
2015). This section describes the MLP architecture.
After the vectorizing, the input matrix in this study is XRN3F, in which Nand F
represent the number of instances and features, respectively. Features are set as columns, and
thus each patent is reflected as a row vector xiR13F. The output is a column vector
YRN31. The main target of the MLP model is to obtain learned neurons in layers of neural
networks that could predict from Xto Y.Figure 4 illustrates the architecture of the MLP
consisting of four layers: one input layer, two hidden layers and an output layer, labeled from
layer 0 to layer 3. The weigh matrixes connect the layers in sequence, and the neurons in the
hidden and output layers are processing units, embodied with activation functions to
transform the inputs to outputs. The numbers of the neurons in the input and output layers
are set as Nand 1, which are subject to the dimensions of the input and output vectors. The
numbers of the neurons in the hidden layers are set as hyperparameters.
The MLP predicts the outputs based on the connection weights and the activation
functions. In specific, the j-th neuron in l-th layer transforms an output based on the following
equations:
8
>
>
>
>
<
>
>
>
>
:
hl
i¼fl X
Ul1
i¼1
hl1
iwl
ij þbl!;l¼1;2
hl¼h3¼f3 X
U2
i¼1ðh2
iw3
iþb3
i!;l¼3
(1)
Screening
patents of ICT
in construction
where lrepresents the layer sequence, Ul1indicates the number of neurons in the (l1)-th
layer, xl1
idenotes the output of i-th neuron it receives, wij is the weight connecting xl1
iand j-th
neuron in l-th layer, bis the bias function for this neuron and flis the activation function in l-th
layer. In this case, the two hidden layers (layer 1 and layer 2) and the output layer (layer 3) use
Rectified Linear Unit and sigmoid functions as the activation functions, respectively. With the
back-propagation process, the neurons in hidden and output layers can be trained with
unique weight matrix and bias, producing different outputs according to the tasks (Garcia-
Laencina et al., 2013). Moreover, the Dropoutsis adopted in the hidden layers to prevent the
overfitting (Srivastava et al., 2014).
3.5 Model training by gradients and dropouts
As mentioned above, the main task of MLP is to make the neurons to be learned, which could
predict Yfrom X. The learning process is achieved by certain iterations, each of which is a
loop consisting of a feed-forward and a back-propagation process (Haykin, 1999;Riedmiller,
1994). In the feed-forward process, the weights and bias in the hidden and output layers are
randomly generated and propelled forward, calculating the output value h3from input X.
Since a sigmoid function is selected as the activation function in the output layer, the errors
follow a logistic distribution between the predictions (with values between 0 and 1) and true
labels (with values are only 0 or 1). The loss function is the following:
J¼
X
N
n¼1
ynlogh3
nþð1ynÞlog1h3
n(2)
In the back-propagation, the parameters θ(including all the weights and bias in hidden and
output layers) would be updated by stochastic gradient descent. Two types of signals
constitute the gradients: (1) global signals that can be computed from the derivatives, which
transform the errors from the loss function and (2) local signals that are the inputs from the
former layer. The θwould be updated from back to front, as the gradients are computed from
the loss value to the former layers, one by one. For the clarity of the back-propagation process,
this study illustrates the updating process of w2
ij and w3
jin layer 2 and 3. Figure 5 shows the
Figure 4.
The MLP architecture
ECAM
functions of the neurons in layer 2 and 3, respectively. The gradient of w3
jðw3
jÞis defined as
the derivative from Jto w3
j, which could be computed by the chain rule of derivatives:
w3
j¼dJ
dw3
j¼dJ
dh33dh3
dN33dN3
dw3
j¼f0N33h2
j(3)
w3
jnew ¼fw3
jold;w3
j(4)
where the f0ðN3Þis the global signal that could be computed by the derivative with loss value,
h2
jis the local signal (the output of the j-th neuron in layer 2) and a is the learning rate that is
predefined.
Similar to layer 3, w2
ij could be computed by the following:
w2
ij ¼dJ
dw3
j¼dJ
dh23dh2
dN23dN2
dw2
ij ¼f0N2w3
jf0N33h1
i(5)
w2
ij new ¼fw2
ij old;w2
ij(6)
where f0ðN2Þw3
jf0ðN3Þis the global signal that is propagated from the loss value, and h1
iis the
local signal. The computations of other parameters, such as w and b, are similar to Eqns (3)
and (5). According to the gradients, the parameters could be updated by algorithms (Eqns (4)
and (6)). Typical optimization algorithms include stochastic gradient descent (Robbins and
Monro, 1951), AdaGrad (Duchi et al., 2011), RMSProp (Tieleman and Hinton, 2012) and Adam
(Kingma and Ba, 2014). This study applies the Adam algorithm as the optimizer, as it has
been recognized as the most effective in most cases with less computation time.
Dropouts are applied in the training process. Dropoutsrefer to temporarily eliminating
some neurons and their incoming and outgoing connections in the neural networks. The
dropped neurons are selected randomly based on a predefined ratio a(a50.2 in this case). In
the back-propagation of a training loop, a new thinned neural network is achieved, with the
proportion of 1aneurons remaining. The parameters updating process would be
Figure 5.
The neurons with input
and activation
functions in the last
two layers
Screening
patents of ICT
in construction
implemented within the thinned neural network. In the feed-forward process of the
subsequent loop, the removed neurons would turn on, and their parameters are obtained from
the remaining neurons by a scale of 1/a. Therefore, training MLP with dropouts can be
regarded as training a larger number of thinned neural networks that share the same
parameters. Such a training fashion effectively prevents neurons from co-adapting, and thus
preventing the overfitting issues (Al Rahhal et al., 2018). As for the details of dropouts, please
see Al Rahhal et al. (2018).
After updating θ, a loop with a feed-forward and back-propagation finishes, which would
be iterated in training. In this case, the maximum of epochs is set as 1,000, and the consecutive
tries of loss value without decrease is set as two. The training process would iterate the loops
until any of the above stop conditions is met. A small number of self-developed python
programs are used to build, train, optimize and validate the model.
4. Results
4.1 Results of hyperparameters tuning
The purpose of hyperparameters tuning is to achieve an MLP model with the best
performance by tuned hyperparameters. The range of the features is from 1,000 to 40,000,
with steps of 1,000 and 2,500 for Fin (1,000, 10,000) and (10,000, 40,000), respectively. With
regard to the number of units, this study adopts the measurement proposed by Fan et al.
(2015), which proposed a range around ffiffiffiffiffiffiffiffiffiffiffiffi
Nþ1
p(Ndenotes the number of neurons in the input
layer). The resulting range of the number of units is from 5 to 69, and the step is set as 8.
Figure 6 shows the hyperparameters tuning process. The MLP model reaches the highest
accuracy when Fis 30,000 and Uis 13.
4.2 Validation
4.2.1 Validation methods. This study validates the proposed approach not only over the data
set in which 348 patents (labeled as ICTC or non-ICTC) were collected from USPTO, but also
the patents from Derwent Innovations Index (DIX). The additional validation over DIX
patents can evaluate the performance of the proposed model in processing texts that were
Figure 6.
The hyperparameters
tuning process
ECAM
written in different genres. Following prior machine learning studies (Sokolova and Lapalme,
2009), this study utilizes precision, recall, F-score to validate the deep learning model based on
true positives (TP), false positives (FP) and false negatives (FN). Generally, TP is the number
of instances the model correctly predicted. FP denotes the instances the model incorrectly
predicted. FN reflects the number of instances the model failed to predict. The precision, recall
and F-score can be computed by
P¼TP
TP þFP ;R¼TP
TP þFN ;F1¼23P3R
PþR(7)
Specifically, as for the 348 USPTO patents in the data set, we used the k-fold cross-validation
along with the training process. In the training process, all the dumping data would be
randomly split into k-fold with the same size, and one of them is set as test instances and
others are used for training. Such a training process is performed in k times, each of which has
a different fold for testing and a different composition of k-1 folds for training. The final
validation value is the average of k validation values. In this way, k-fold cross-validation
prevents the bias in data selection and ensures the measures of the performances with
objections (Friedman et al., 2001). In this study, k is set as 5, and all the annotated data (totally
348 USPTO patents annotated as ICTC or non-ICTC class were obtained in Section 3.1) were
randomly divided into 5 folds. For each training round, four folds consist of the training
collection and the other fold consists of the testing collection.
Besides the annotated data, this study collected and randomly selected 200 patents from
DIX as an additional testing data set, where the patents were written by inventors from a
much wider range of countries and agencies. The search strategy is the same as the retrieval
from USPTO: topic 5construction project, project management, infrastructure project, civil
engineering and transportation project. As the DIX does not provide fields of claim and
description, title and abstract were collected as raw data. Before validation, the raw data of
the 200 patents have to be annotated, processed and vectorized by the same processes
mentioned in Section 3.1, 3.2 and 3.3.
4.2.2 Validation results. In this validation, the goal is to verify if the MLP has better
screening accuracy than the traditional machine learning models. The performance of the
MLP is compared against existing machine learning models, including Gaussian Naive Bayes
(GNB), SVM and Bernoulli Naive Bayes (BNB). Figure 7 shows the accuracy over the different
Figure 7.
The precision values
for MLP and machine
learning models over
the features
Screening
patents of ICT
in construction
feature numbers. The highest precision value for each model is marked above the lines. By
examining the figure, we can verify that the MLP model is superior to those machine learning
models over all the features except K51,000 and K540,000. We can also observe that the
MLP model is more sensitive to the number features, with the highest standard deviation
value (0.032) over the models. This is consistent with one of the major differences between
deep learning and traditional machine learning models: the traditional machine learning
models are not capable of adjusting the model complexity according to the inputs, whereas
the deep learning could tune the structure (number of layers and neurons) that is most
suitable for input features (Moraes et al., 2013).
Table 3 illustrates the cross-validation results over the optimized MLP (K530,000,
U513), GNB, SVM and BNB. As was mentioned above, fivefold cross-validation is used to
verify the performance of the trained model. All the annotated instances were shuffled and
randomly divided into five folds with same size, and four of them were used for training and
the rest were testing instances. The training and testing would be performed five times, and
each time has a different fold for testing and a different combination of for folds for training.
The performance value is obtained by the mean of the five testing results. It can be observed
that MLP (K530,000, U513) has the best performance in all the three indexes (precision,
recall and F1 score).
Table 4 shows the validation results over another database, which is used to evaluate the
generality of the proposed model. The validation results, described by Table 4, indicate that
the learned classifier based on MLP is also highly accurate over the database from DIX, in
which patents were written in different levels by a variety of inventors from different
countries. In addition, the MLP also outperforms the machine learning methods.
4.3 The screened ICTC patents
Besides the validation results, some important implications should be further discussed. To
show the differences among the patents in the three collections the screened corpus,
collections 1 and 2 (Table 1), this study plots the figures of feature space for each of the
collections (Figure 8). All the patents were vectorized according to the processes in Section
3.13.3. The t-Distributed Stochastic Neighbor Embedding (TSNE) algorithm (Czerniawski
et al., 2018;Maaten and Hinton, 2008) was adopted to project the high dimensional feature
vectors into a 2D plot, in which the physical distance between two features roughly
represents the degree of association of them in the corresponding collection.
Precision Recall F1 score
MLP (K530,000, U513) 0.955 0.954 0.954
GNB (K525,000) 0.925 0.919 0.918
SVM (K540,000) 0.86 0.852 0.848
BNB (K535,000) 0.883 0.86 0.853
Precision Recall F1 score
MLP (K530,000, U513) 0.897 0.897 0.897
GNB (K525,000) 0.849 0.852 0.849
SVM (K540,000) 0.85 0.852 0.848
BNB (K535,000) 0.444 0.667 0.533
Table 3.
Cross-validation
results over MLP, GNB,
SVM and BNB in the
initial data set
Table 4.
Cross-validation
results over MLP, GNB,
SVM and BNB in the
dataset from DIX
ECAM
Figure 8 (a) depicts the feature space of collection 1, in which patents were searched by ICT
classes and AEC domain keywords. The features in this figure are averagely distributed,
incorporating a large number of ICT-related features, but some typical ICTC terms do not
appear. Such feature distribution indicates that the patents in this collection are mainly
Figure 8.
TSNE plots of feature
spaces (a) The plot of
patents screened by
strategy 1; (b) the plot
of patents screened by
strategy 2; (c) the plot
of patents screened by
the proposed approach
Screening
patents of ICT
in construction
relevant to ICT, but not ICTC. A possible explanation for this might be that the AEC domain
keywords are not capable of discerning ICTC patents from ICT patents using query-based
methods. For example, the keyword construction projectmay match patents related to
construction projects, but it can also match patents of software projectcontaining sayings
about construction projectwhich means constructing a project. Despite the miss-matching
problem, strategy 2 can lead to short coverage of ICT techniques. As Figure 8 (b) shows, the
features are agglomerated into clusters, indicating an unbalanced distribution of topics. The
features, not surprisingly, are mainly related to the searching keywords, such as wireless and
mobile. The features in Figure 8 (C) are distributed averagely, incorporating a wide range of
ICTC-related terminologies, such as laser, construction machine and radio. This indicates that
the proposed approach is more accurate in screening ICTC patents than traditional searching
engines.
5. Discussion and conclusion
ICT applications are a key determinant to improve the level of coordination and collaboration
in the AEC industry. Even though patents have been recognized as a valuable resource to
provide technological knowledge, the patent offices have not provided a specific classification
of ICTC patents. Acknowledging this research opportunity, the present study accurately and
widely screens a corpus of ICTC patents, by proposing an approach based on deep learning
and NLP techniques.
Specifically, this study has made the following contributions. (1) This study contributes an
approach to widely and accurately retrieve and collect a corpus of patents for domains like
ICTC that does not exist as a specific classification in patents and hardly being represented
by queries. Although patent offices provide elaborate classification schemes, they cannot
satisfy all the requirements in the real world. Therefore, when a collection of patents does not
exist in the classification schemes, query-based methods become the only possible way to
search these patents. However, query-based methods were developed to retrieve relevant
documents for a specific patent application rather than a set of patents. For the collections like
ICTC that incorporates a variety of technologies, it is an extremely challenging task for the
query-based methods to retrieve the patents simply by a query. (2) The proposed approach
takes advantage of deep learning and NLP techniques. Although deep learning has become
prominent in processing textual data, the previous studies in the AEC area mainly utilized
machine learning methods to perform classification tasks, whose performance highly
depends on feature selection because the traditional machine learning models could only
learn the linear relations. In contrast, deep learning models are more advanced in learning
non-linear relations by using the layers of neural networks, thus being more suitable for
complex tasks with specific objectives. The validation results indicate that the MLP model
outperforms the traditional machine models in classifying the ICTC patents. In addition, NLP
techniques were employed to process the raw data. In the AEC area, most previous studies
only utilized the N-gram, tokenization and stop-words removal, ignoring the advanced NLP
tools such as lemmatization and POS. This study utilizes lemmatization and POS to convert
the words into stems for generating more accurate N-grams from the textual data. (3) In
practice, this study contributes a specific collection for ICTC patents, which is not provided
by the patent offices. The collection widely and accurately covers the ICT applications in the
construction, not only constituting a dictionary for accessing ICTC, but also identifying
problems to be solved by the state-of-art ICTC inventions and recognizing all possible specific
embodiments of ICTC.
The present study is not without limitations. The feature extraction process is based on
traditional BOW models, which does not take the semantic meanings in contexts into
consideration. This limitation, however, has been largely offset by the proposed supervised
ECAM
MLP model, which learns complex relations between the inputs and outputs by training the
deep layers of neurons. This could perform the prediction task with good performance
without considering the semantic meanings. This study focuses on classifying ICTC and non-
ICTC. Future research is needed to concentrate on more complex deep learning approaches
that could automatically categorize the ICTC patents according to the technological
components or the management issues in practice.
Note
1. The process of manually checking labels a patent as either ICTC or non-ICTC, performed by three
PhD students (their research directions are related to the AEC area) through in-depth reviewing of
the title, abstract, claim and description. A patent can be labeled as ICTC class if the content
expresses that the essence of the technology application is under the ICT scope and the AEC industry
is a major embodiment in which the technology application can be implemented. To prevent
mistakes as much as possible, two students annotated the patents independently, and the third
student would make a judgment when the labels of a patent are inconsistent.
References
Adriaanse, A., Voordijk, H. and Dewulf, G. (2010), The use of interorganisational ICT in United States
construction projects,Automation in Construction, Vol. 19 No. 1, pp. 73-83.
Aggarwal, C.C. and Reddy, C.K. (2013), Data Clustering: Algorithms and Applications, CRC press,
Boca Raton.
Agrawal, A. and Henderson, R. (2002), Putting patents in context: exploring knowledge transfer from
MIT,Management Science, Vol. 48 No. 1, pp. 44-60.
Ahuja, V., Yang, J. and Shankar, R. (2009), Study of ICT adoption for building project management in
the Indian construction industry,Automation in Construction, Vol. 18 No. 4, pp. 415-423.
Ahuja, V., Yang, J., Skitmore, M. and Shankar, R. (2010), An empirical test of causal relationships of
factors affecting ICT adoption for building project management: an Indian SME case study,
Construction Innovation, Vol. 10 No. 2, pp. 164-180.
Al Rahhal, M.M., Bazi, Y., Al Zuair, M., Othman, E. and BenJdira, B. (2018), Convolutional neural
networks for electrocardiogram classification,Journal of Medical and Biological Engineering,
Vol. 38 No. 6, pp. 1014-1025.
Alberts, D., Yang, C.B., Fobare-DePonio, D., Koubek, K., Robins, S., Rodgers, M., Simmons, E. and
DeMarco, D. (2017), Introduction to Patent Searching, Current Challenges in Patent Information
Retrieval, Springer, Berlin, Heidelberg, pp. 3-45.
Alsafouri, S. and Ayer, S.K. (2018), Review of ICT implementations for facilitating information flow
between virtual models and construction project sites,Automation in Construction, Vol. 86,
August 2016, pp. 176-189.
Azad, H.K. and Deepak, A. (2019), Query expansion techniques for information retrieval: a survey,
Information Processing and Management, Vol. 56 No. 5, pp. 1698-1735.
Bell, G., Hey, T. and Szalay, A. (2009), Beyond the data deluge,Science, Vol. 323 No. 5919,
pp. 1297-1298.
Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C. (2003), A neural probabilistic language model,
Journal of Machine Learning Research, Vol. 3 Feb, pp. 1137-1155.
Benson, C.L. and Magee, C.L. (2013), A hybrid keyword and patent class methodology for selecting
relevant sets of patents for a technological field,Scientometrics, Vol. 96 No. 1, pp. 69-82.
Bird, S. and Loper, E. (2004), NLTK: the natural language toolkit,Proceedings of the ACL 2004 on
Interactive Poster and Demonstration Sessions, Association for Computational Linguistics,
Sydney, Australia, pp. 63-70.
Screening
patents of ICT
in construction
Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003), Latent dirichlet allocation,Journal of Machine Learning
Research, Vol. 3, pp. 993-1022.
Bouadjenek, M.R., Sanner, S. and Ferraro, G. (2015), A study of query reformulation for patent prior
art search with partial patent applications,Proceedings of the 15th International Conference on
Artificial Intelligence and Law, Association for Computing Machinery (ACM), San Diego
California, USA, pp. 23-32.
Brown, I. and Mues, C. (2012), An experimental comparison of classification algorithms for imbalanced
credit scoring data sets,Expert Systems with Applications, Vol. 39 No. 3, pp. 3446-3453.
Cakir, L. and Yilmaz, N. (2014), Polynomials, radial basis functions and multilayer perceptron neural
network methods in local geoid determination with GPS/levelling,Measurement, Vol. 57,
pp. 148-153.
Cambria, E. and White, B. (2014), Jumping NLP curves: a review of natural language processing
research,IEEE Computational Intelligence Magazine, Vol. 9 No. 2, pp. 48-57.
Cassetta, E., Marra, A., Pozzi, C. and Antonelli, P. (2017), Emerging technological trajectories and new
mobility solutions. A large-scale investigation on transport-related innovative start-ups and
implications for policy,Transportation Research Part A: Policy and Practice, Vol. 106
March, pp. 1-11.
Chakrabarti, S., Dom, B. and Indyk, P. (1998), Enhanced hypertext categorization using hyperlinks,
Acm Sigmod Record, Vol. 27 No. 2, pp. 307-318.
Chiarello, F., Cimino, A., Fantoni, G. and DellOrletta, F. (2018), Automatic users extraction from
patents,World Patent Information, Vol. 54, pp. 28-38.
Choi, S., Kang, D., Lim, J. and Kim, K. (2012), A fact-oriented ontological approach to SAO-based
function modeling of patents for implementing Function-based Technology Database,Expert
Systems with Applications, Vol. 39 No. 10, pp. 9129-9140.
Cocarascu, O. and Toni, F. (2018), Combining deep learning and argumentative reasoning for the
analysis of social media textual content using small data sets,Computational Linguistics,
Vol. 44 No. 4, pp. 833-858.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. (2011), Natural
language processing (almost) from scratch,Journal of Machine Learning Research, Vol. 12
Aug, pp. 2493-2537.
Czerniawski, T., Sankaran, B., Nahangi, M., Haas, C. and Leite, F. (2018), 6D DBSCAN-based
segmentation of building point clouds for planar object classification,Automation in
Construction, Vol. 88, pp. 44-58.
Davies, R. and Harty, C. (2013), Implementing Site BIM: a case study of ICT innovation on a large
hospital project,Automation in Construction, Vol. 30, pp. 15-24.
Dehlin, S. and Olofsson, T. (2008), An evaluation model for ICT investments in construction projects,
Electronic Journal of Information Technology in Construction, Vol. 13, pp. 343-361.
Duchi, J., Hazan, E. and Singer, Y. (2011), Adaptive subgradient methods for online learning and
stochastic optimization,Journal of Machine Learning Research, Vol. 12 Jul, pp. 2121-2159.
El-Ghandour, W. and Al-Hussein, M. (2004), Survey of information technology applications in
construction,Construction Innovation, Vol. 4 No. 2, pp. 83-98.
Enesi, F.A., Oyefolahan, I.O., Abdullahi, M.B. and Salaudeen, M.T. (2018), Enhanced query expansion
algorithm: framework for effective ontology based information retrieval system,i-Managers
Journal on Computer Science, Vol. 6 No. 4, pp. 1-11.
Fall, C., Benzineb, K., Guyot, J., T
orcsv
ari, A. and Fi
evet, P. (2003a), Computer-assisted categorization
of patent documents in the international patent classification,Proceedings of the International
Chemical Information Conference (ICIC03), Royal Society of Chemistry, N
^
ımes, France.
Fall, C.J., T
orcsv
ari, A., Benzineb, K. and Karetka, G. (2003b), Automated categorization in the
international patent classification,SIGIR Forum, Vol. 37 No. 1, pp. 10-25.
ECAM
Fan, X., Li, S. and Tian, L. (2015), Chaotic characteristic identification for carbon price and an multi-
layer perceptron network prediction model,Expert Systems with Applications, Vol. 42 No. 8,
pp. 3945-3952.
Flyvbjerg, B. (2014), What you should know about megaprojects and why: an overview,Project
Management Journal, Vol. 45 No. 2, pp. 6-19.
Forman, G. (2002), Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for
Text Classificatio, Vol. 2431, pp. 150-162.
Friedman, J., Hastie, T. and Tibshirani, R. (2001), The Elements of Statistical Learning, Springer series
in statistics New York, New York, NY.
Frits, S. (2007), Strategy to enhance use of ICT in construction,Proceedings of CIB World Building
Congress, International Council for Building, Cape Town, pp. 2527-2535.
Garc
ıa Adeva, J.J., Pikatza Atxa, J.M., Ubeda Carrillo, M. and Ansuategi Zengotitabengoa, E. (2014),
Automatic text classification to support systematic reviews in medicine,Expert Systems with
Applications, Vol. 41 No. 4, pp. 1498-1508.
Garcia-Laencina, P.J., Sancho-Gomez, J.L. and Figueiras-Vidal, A.R. (2013), Classifying patterns with
missing values using Multi-Task Learning perceptrons,Expert Systems with Applications,
Vol. 40 No. 4, pp. 1333-1341.
Gerken, J.M. (2012), A new instrument for technology monitoring: novelty in patents measured by
semantic patent analysis,Scientometrics, Vol. 91 No. 3, pp. 645-670.
Giachanou, A., Salampasis, M. and Paltoglou, G. (2015), Multilayer source selection as a tool for
supporting patent search and classification,Information Retrieval Journal, Vol. 18 No. 6,
pp. 559-585.
Gimpel, K., Schneider, N., OConnor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D.,
Flanigan, J. and Smith, N.A. (2011), Part-of-Speech tagging for twitter: annotation, features,
and experiments,Meeting of the Association for Computational Linguistics: Human Language
Technologies: Short Papers, Vol. 2, pp. 42-47.
Girthana, K. and Swamynathan, S. (2018), Semantic query-based patent summarization system
(SQPSS),International Conference on Intelligent Information Technologies, Vol. 941,
pp. 169-179.
Greiman, V.A. (2013), Megaproject Management: Lessons on Risk and Project Management from the
Big Dig, John Wiley and Sons.
Gwak, J.H. and Sohn, S.Y. (2018), A novel approach to explore patent development paths for subfield
technologies,Journal of the Association for Information Science and Technology, Vol. 69 No. 3,
pp. 410-419.
Habash, N., Rambow, O. and Roth, R. (2009), MADAþTOKAN: a toolkit for Arabic tokenization,
diacritization, morphological disambiguation, POS tagging, stemming and lemmatization,
Proceedings of the 2nd International Conference on Arabic Language Resources and Tools
(MEDAR), The MEDAR Consortium, Cairo, Egypt, Vol. 41, p. 62.
Haddi, E., Liu, X. and Shi, Y. (2013), The role of text pre-processing in sentiment analysis,Procedia
Computer Science, Vol. 17, pp. 26-32.
Haykin, S. (1999), Neural Networks a Comprehensive Introduction, Prentice Hall, New Jersey, NJ.
Hoetker, G. and Agarwal, R. (2007), Death hurts, but it isnt fatal: the postexit diffusion of knowledge
created by innovative companies,Academy of Management Journal, Vol. 50 No. 2, pp. 446-467.
Intarakumnerd, P. and Charoenporn, P. (2015), Impact of stronger patent regimes on technology
transfer: the case study of Thai automotive industry,Research Policy, Vol. 44 No. 7, pp. 1314-1326.
Ittoo, A., Nguyen, L.M. and van den Bosch, A. (2016), Text analytics in industry: challenges,
desiderata and trends,Computers in Industry, Vol. 78, pp. 96-107.
Kaplan, S. and Vakili, K. (2015), The double-edged sword of recombination in breakthrough
innovation,Strategic Management Journal, Vol. 36 No. 10, pp. 1435-1457.
Screening
patents of ICT
in construction
Kingma, D.P. and Ba, J. (2014), Adam: a method for stochastic optimization,arXiv preprint arXiv:
1412.6980.
Kurdi, M.Z. (2017), Natural Language Processing and Computational Linguistics 2: Semantics,
Discourse and Applications, John Wiley and Sons.
Lee, C., Song, B. and Park, Y. (2013), How to assess patent infringement risks: a semantic patent claim
analysis using dependency relationships,Technology Analysis and Strategic Management,
Vol. 25 No. 1, pp. 23-38.
Levitt, R.E. (2007), CEM research for the next 50 years: maximizing economic, environmental, and
societal value of the built environment,Journal of Construction Engineering and Management-
Asce, Vol. 133 No. 9, pp. 619-628.
Levy, O., Goldberg, Y. and Dagan, I. (2015), Improving distributional similarity with lessons learned
from word embeddings,Transactions of the Association for Computational Linguistics, Vol. 3,
pp. 211-225.
Li, H., Chan, G., Wong, J.K.W. and Skitmore, M. (2016), Real-time locating systems applications in
construction,Automation in Construction, Vol. 63, pp. 37-47.
Li, S., Hu, J., Cui, Y. and Hu, J. (2018), DeepPatent: patent classification with convolutional neural
networks and word embedding,Scientometrics, Vol. 117 No. 2, pp. 721-744.
Li, X., Shen, G.Q., Wu, P. and Yue, T. (2019), Integrating building information modeling and
prefabrication housing production,Automation in Construction, Vol. 100, pp. 46-60.
Li, Z., Tate, D., Lane, C. and Adams, C. (2012), A framework for automatic TRIZ level of invention
estimation of patents using natural language processing, knowledge-transfer and patent
citation metrics,Computer-Aided Design, Vol. 44 No. 10, pp. 987-1010.
Liu, S.-H., Liao, H.-L., Pi, S.-M. and Hu, J.-W. (2011), Development of a patent retrieval and analysis
platform a hybrid approach,Expert Systems with Applications, Vol. 38 No. 6, pp. 7864-7868.
Maaten, L. and Hinton, G. (2008), Visualizing data using t-SNE,Journal of Machine Learning
Research, Vol. 9 No. 2605, pp. 2579-2605.
Mahdabi, P. and Crestani, F. (2014), The effect of citation analysis on query expansion for patent
retrieval,Information Retrieval, Vol. 17 Nos 5-6, pp. 412-429.
Mahdabi, P., Keikha, M., Gerani, S., Landoni, M. and Crestani, F. (2011), Building queries for prior-art
search,Information Retrieval Facility Conference, Springer, Berlin, Heidelberg.
Michel, J. and Bettels, B. (2001), Patent citation analysis. A closer look at the basic input data from
patent search reports,Scientometrics, Vol. 51 No. 1, pp. 185-201.
Miro
nczuk, M.M. and Protasiewicz, J. (2018), A recent overview of the state-of-the-art elements of text
classification,Expert Systems with Applications, Vol. 106, pp. 36-54.
Mok, K.Y., Shen, G.Q. and Yang, J. (2015), Stakeholder management studies in mega construction
projects: a review and future directions,International Journal of Project Management, Vol. 33
No. 2, pp. 446-457.
Moraes, R., Valiati, J.F. and Neto, W.P.G. (2013), Document-level sentiment classification: an empirical
comparison between SVM and ANN,Expert Systems with Applications, Vol. 40 No. 2,
pp. 621-633.
Munkov
a, D., Munk, M. and Voz
ar, M. (2013), Data pre-processing evaluation for text mining:
transaction/sequence model,Procedia Computer Science, Vol. 18, pp. 1198-1207.
Nadeau, D. and Sekine, S. (2007), A survey of named entity recognition and classification,
Lingvisticae Investigationes, Vol. 30 No. 1, pp. 3-26.
Niemann, H., Moehrle, M.G. and Frischkorn, J. (2017), Use of a new patent text-mining and
visualization method for identifying patenting patterns over time: concept, method and test
application,Technological Forecasting and Social Change, Vol. 115, pp. 210-220.
Onan, A., Koruko
glu, S. and Bulut, H. (2016), Ensemble of keyword extraction methods and
classifiers in text classification,Expert Systems with Applications, Vol. 57, pp. 232-247.
ECAM
Pavlinek, M. and Podgorelec, V. (2017), Text classification method based on self-training and LDA
topic models,Expert Systems with Applications, Vol. 80, pp. 83-93.
Perez-Molina, E. (2018), The role of patent citations as a footprint of technology,Journal of the
Association for Information Science and Technology, Vol. 69 No. 4, pp. 610-618.
Riedmiller, M. (1994), Advanced supervised learning in multi-layer perceptrons-from backpropagation
to adaptive learning algorithms,Computer Standards and Interfaces, Vol. 16 No. 3, pp. 265-278.
Robbins, H. and Monro, S. (1951), A stochastic approximation method,The Annals of Mathematical
Statistics, Vol. 22, pp. 400-407.
Rosenblatt, F. (1961), Principles of Neurodynamics. Perceptrons and the Theory of Brain Mechanisms,
Cornell Aeronautical Lab, Buffalo NY.
Saiki, T., Akano, Y., Watanabe, C. and Tou, Y. (2006), A new dimension of potential resources in
innovation: a wider scope of patent claims can lead to new functionality development,
Technovation, Vol. 26 No. 7, pp. 796-806.
Sardroud, J.M. (2015), Perceptions of automated data collection technology use in the construction
industry,Journal of Civil Engineering and Management, Vol. 21 No. 1, pp. 54-66.
Shalaby, W. and Zadrozny, W. (2018), Toward an interactive patent retrieval framework based on
distributed representations,The 41st International ACM SIGIR Conference on Research and
Development in Information Retrieval, Association for Computing Machinery (ACM), New York,
pp. 957-960.
Shalaby, W. and Zadrozny, W. (2019), Patent retrieval: a literature review,Knowledge and
Information Systems, Vol. 61, pp. 1-30.
Shannon, C.E. (1948), A mathematical theory of communication,Bell System Technical Journal,
Vol. 27 No. 3, pp. 379-423.
Shekarpour, S., Marx, E., Ngonga Ngomo, A.C. and Auer, S. (2015), SINA: semantic interpretation of
user queries for question answering on interlinked data,Journal of Web Semantics, Vol. 30,
pp. 39-51.
Silva, F.N., Amancio, D.R., Bardosova, M., Costa, L.D. and Oliveira, O.N. (2016), Using network
science and text analytics to produce surveys in a scientific topic,Journal of Informetrics,
Vol. 10 No. 2, pp. 487-502.
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J.,
Antonoglou, I., Panneershelvam, V. and Lanctot, M. (2016), Mastering the game of Go with
deep neural networks and tree search,Nature, Vol. 529 No. 7587, p. 484.
Smith, H. (2002), Automation of patent classification,World Patent Information, Vol. 24 No. 4,
pp. 269-271.
Sokolova, M. and Lapalme, G. (2009), A systematic analysis of performance measures for
classification tasks,Information Processing and Management, Vol. 45 No. 4, pp. 427-437.
Sparck Jones, K. (1972), A statistical interpretation of term specificity and its application in retrieval,
Journal of Documentation, Vol. 28 No. 1, pp. 11-21.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R. (2014), Dropout: a
simple way to prevent neural networks from overfitting,The Journal of Machine Learning
Research, Vol. 15 No. 1, pp. 1929-1958.
Tannebaum, W. and Rauber, A. (2014), Using query logs of USPTO patent examiners for automatic
query expansion in patent searching,Information Retrieval, Vol. 17 Nos 5-6, pp. 452-470.
Terragno, P.J. (1979), Patents as technical literature,IEEE Transactions on Professional
Communication, PC-22 No. 2, pp. 101-104.
Tieleman, T. and Hinton, G. (2012), Lecture 6.5-rmsprop: divide the gradient by a running average of
its recent magnitude,COURSERA: Neural Networks for Machine Learning, Vol. 4 No. 2,
pp. 26-31.
USPTO (2007), Manual of Patent Examining Procedure, 8th ed., Thomson/West, Alexandria.
Screening
patents of ICT
in construction
Venugopalan, S. and Rai, V. (2015), Topic based classification and pattern identification in patents,
Technological Forecasting and Social Change, Vol. 94, pp. 236-250.
Vlachidis, A. and Tudhope, D. (2016), A knowledge-based approach to Information Extraction for
semantic interoperability in the archaeology domain,Journal of the Association for
Information Science and Technology, Vol. 67 No. 5, pp. 1138-1152.
Wang, J. (2018), Innovation and government intervention: a comparison of Singapore and Hong
Kong,Research Policy, Vol. 47 No. 2, pp. 399-412.
Wang, X., Jiang, W. and Luo, Z. (2016), Combination of convolutional and recurrent neural network
for sentiment analysis of short texts,Proceedings of COLING 2016, the 26th International
Conference on Computational Linguistics: Technical Papers, Association for Computational
Linguistics, Osaka, Japan, pp. 2428-2437.
Wu, C.H., Yun, K. and Huang, T. (2010), Patent classification system using a new hybrid genetic
algorithm support vector machine,Applied Soft Computing Journal, Vol. 10 No. 4,
pp. 1164-1177.
Wu, P., Song, Y., Shou, W., Chi, H., Chong, H.Y. and Sutrisna, M. (2017), A comprehensive analysis of
the credits obtained by LEED 2009 certified green buildings,Renewable and Sustainable
Energy Reviews, Vol. 68, pp. 370-379.
Zhang, L., Liu, Z., Li, L., Shen, C. and Li, T. (2018), PatSearch: an integrated framework for
patentability retrieval,Knowledge and Information Systems, Vol. 57 No. 1, pp. 135-158.
Zhao, Z., Xu, S., Kang, B.H., Kabir, M.M.J., Liu, Y. and Wasinger, R. (2015), Investigation and
improvement of multi-layer perceptron neural networks for credit scoring,Expert Systems with
Applications, Vol. 42 No. 7, pp. 3508-3516.
Zidane, Y.J.T., Johansen, A. and Ekambaram, A. (2013), Megaprojects - challenges and lessons
learned,Procedia - Social and Behavioral Sciences, Vol. 74, pp. 349-357.
Corresponding author
Xue Lin can be contacted at: linxue@nju.edu.cn
For instructions on how to order reprints of this article, please visit our website:
www.emeraldgrouppublishing.com/licensing/reprints.htm
Or contact us for further details: permissions@emeraldinsight.com
ECAM
... There are more and more NLP-related studies carried out in construction-related areas. Such as the application of NLP in document management [13,25,90], safety management [16,24,82], compliance checking [73,93,101], risk management [41][42][43], and Building Information Modeling (BIM) [50,91,113]. However, there is no relevant review research referring to NLP application in construction-related areas. ...
... Liu et al. [54] is the only exception (AEC Image Caption in Table 5) who has shared the annotated dataset, source code, and even the trained model to enable other researchers to easily make further exploration based on their research. Due to the limitation of data, some studies can only be [46], [51], [19], [5], [48], [97], [60], [25], [70], [45], [67], [75], [77], [85] 14 28 C/C [14], [13], [86], [4], [3], [2], [55], [65], [33], [30], [90], [1] 12 Q&A Sys [38], [105] 2 AA/SM C/C [17], [83], [84], [82], [28], [99], [98], [16], [107], [108], [7], [27], [8] 13 22 KE/KRep/ KRet [24], [95], [96], [115], [37], [106], [ carried by a few researchers with data, forming a research monopoly. The most apparent research monopoly, for example, is the ACC research with "International Building Code" as the data source in Table 5, which is almost conducted in the research team with El-Gohary as the core. ...
... The other latest researches like Hassan and Le [30] classified the contractual text into requirements and nonrequirements text with Word2Vec and SVM, achieving better performance than other machine learning methods. Wu et al. [90] vectorized the construction patent documents in combination with N-Gram and TF-IDF. The Multilayer Perceptron (MLP) model is then used to identify the patent type of information and communication technology. ...
Article
Full-text available
In the construction industry under “Industry 4.0”, Natural Language Processing (NLP) has been widely used to process and analyze text data to achieve construction intelligence. However, there lacks a comprehensive review of NLP application in construction-related areas, raising bar of research entry and setting obstacles for the rapid development in this fields. Ninety one NLP-related research articles in construction-related fields were retrieved to conduct a scientometric analysis using CiteSpace and VOSViewer, and summarized from the perspectives of anchordatasets/data sources, technologies/tools, and applications and progress. The results show that data isolation causing non-reproducibility of research is one of the severe problems to be solved. Besides, pure NLP application studies will no longer meet the future industry development needs and more cross-modal interdisciplinary research based on the end-to-end pre-trained neural network model framework is needed. This study helps readers gain an in-depth understanding of the NLP application and development in construction.
... Their method provided a userfriendly way to analyze and visualize patent data. Hyun et al. (2020) and Wu et al. (2020) explored semantic analysis of patent data, emphasizing the role of NLP in identifying technical trends [23] and screening patents in specific domains such as communication technologies in construction [24]. These studies highlight the evolution of NLP and ML tools and the need for continuous innovation in this field. ...
... Their method provided a userfriendly way to analyze and visualize patent data. Hyun et al. (2020) and Wu et al. (2020) explored semantic analysis of patent data, emphasizing the role of NLP in identifying technical trends [23] and screening patents in specific domains such as communication technologies in construction [24]. These studies highlight the evolution of NLP and ML tools and the need for continuous innovation in this field. ...
Article
Full-text available
As official public records of inventions, patents provide an understanding of technological trends across the competitive landscape of various industries. However, traditional manual analysis methods have become increasingly inadequate due to the rapid expansion of patent information and its unstructured nature. This paper contributes an original approach to enhance the understanding of patent data, with connected vehicle (CV) patents serving as the case study. Using free, open-source natural language processing (NLP) libraries, the author introduces a novel metric to quantify the alignment of classifications by a subject matter expert (SME) and using machine learning (ML) methods. The metric is a composite index that includes a purity factor, evaluating the average ML conformity across SME classifications, and a dispersion factor, assessing the distribution of ML assigned topics across these classifications. This dual-factor approach, labeled the H-index, quantifies the alignment of ML models with SME understanding in the range of zero to unity. The workflow utilizes an exhaustive combination of state-of-the-art tokenizers, normalizers, vectorizers, and topic modelers to identify the best NLP pipeline for ML model optimization. The study offers manifold visualizations to provide an intuitive understanding of the areas where ML models align or diverge from SME classifications. The H-indices reveal that although ML models demonstrate considerable promise in patent analysis, the need for further advancements remain, especially in the domain of patent analysis.
... For example, Arts developed natural language processing techniques to determine the creation and impact of new technologies in the USA patent community (Arts et al., 2021). To screen the ICTC patent corpus accurately, Wu used deep learning and NLP techniques to automatically identify whether patents are related to building information and communication technologies (Wu et al., 2020). Trappey developed a patent recommender based on NLP to discover semantically related patents for technology mining and trend analysis (Trappey et al., 2021). ...
... These models can analyze the linguistic cues and contextual information to provide personalized feedback, guidance, and support to language learners. RNNs can also be used for language generation, simulating conversations or generating practice materials to help learners overcome their anxiety and gain confidence in their language skills [15]. ...
Article
The Assisted Assessment and Guidance System serves as a valuable tool in supporting individuals' learning, growth, and development. The Assisted Assessment and Guidance System with Natural Language Processing (NLP) is an innovative software application designed to provide personalized and intelligent support for assessment and guidance processes in various domains. NLP techniques are employed to analyze and understand human language, allowing the system to extract valuable insights from text-based data and provide tailored feedback and guidance. This paper proposed an Integrated Optimization Directional Clustering Classification (IODCc) for assessment of the foreign language anxiety. Additionally, the paper introduces an Integrated Optimization Directional Clustering Classification (IODCc) approach for assessing foreign language anxiety. This approach incorporates two optimization models, namely Black Widow Optimization (BWO) and Seahorse Optimization (SHO). BWO and SHO are metaheuristic optimization algorithms that simulate the behaviors of black widow spiders and seahorses, respectively, to improve the accuracy of the assessment process. The integration of these optimization models within the IODCc approach aims to enhance the accuracy and effectiveness of the foreign language anxiety assessment. Simulation analysis is performed for the data collected from the 1000 foreign language students. The experimental analysis expressed that the proposed IODCc model achieves an accuracy of 99% for the classification. The findings suggested that through pre-training of languages, the anxiety of the students will be reduced.
... It studies theories and method for realizing human-computer interaction based on natural language (Chowdhury, 2003). NLP applications in construction have mostly invested the following processes: document management (Caldas et al., 2003;Fan et al., 2014;Wu et al., 2020), safety management (Cheng et al., 2020;Fan and Li 2013;Tixier et al., 2017), compliance checking (Salama and El-Gohary, 2013;Xue and Zhang, 2020), risk management (Lee et al., 2020) and BIM management (Lin et al., 2016;Xie et al., 2019;Zhou et al., 2020). As far as the latter is concerned NLP has been adopted to solve a well-known problem of AECO sector: information retrieval. ...
... This method has been extended and improved to support more complex operations such as off-site construction with the help of BIM data (Wong Chong and Zhang 2021). Other successful applications of NLP in AEC/FM tasks include document management (Wu et al. 2020), safety management , risk management (Lee et al. 2020), and user requirements analysis (Zhou et al. 2019). ...
Preprint
Full-text available
Advancements in sensor technology, artificial intelligence (AI), and augmented reality (AR) have unlocked opportunities across various domains. AR and large language models like GPT have witnessed substantial progress and are increasingly being employed in diverse fields. One such promising application is in operations and maintenance (O&M). O&M tasks often involve complex procedures and sequences that can be challenging to memorize and execute correctly, particularly for novices or under high-stress situations. By marrying the advantages of superimposing virtual objects onto the physical world, and generating human-like text using GPT, we can revolutionize O&M operations. This study introduces a system that combines AR, Optical Character Recognition (OCR), and the GPT language model to optimize user performance while offering trustworthy interactions and alleviating workload in O&M tasks. This system provides an interactive virtual environment controlled by the Unity game engine, facilitating a seamless interaction between virtual and physical realities. A case study (N=15) is conducted to illustrate the findings and answer the research questions. The results indicate that users can complete similarly challenging tasks in less time using our proposed AR and AI system. Moreover, the collected data also suggests a reduction in cognitive load and an increase in trust when executing the same operations using the AR and AI system.
Article
Full-text available
This article explores the complex relationship between patent law and sustainable development, highlighting the potential for patents to both promote and hinder progress towards a more sustainable future. The study examines how patent law can incentivize innovation and investment in sustainable technologies, while also considering how patents can create barriers to access and dissemination of these technologies. The article also discusses the role of patent law in promoting global cooperation and technology transfer, as well as the potential for alternative approaches to intellectual property protection to better support sustainable development goals. Through a comprehensive analysis of the literature and real-world case studies, the article provides a nuanced understanding of how patent law can impact sustainable development and offers recommendations for policymakers seeking to balance the interests of inventors and society as a whole.
Article
Full-text available
As a powerful artificial intelligence tool, the Artificial Neural Network (ANN) has been increasingly applied in the field of construction management (CM) during the last few decades. However, few papers have attempted to draw up a systematic commentary to appraise the state-of-the-art research on ANNs in CM except the one published in 2000. In the present study, a scientometric analysis was conducted to comprehensively analyze 112 related articles retrieved from seven selected authoritative journals published between 2000 and 2020. The analysis identified co-authorship networks, collaboration networks of countries/regions, co-occurrence networks of keywords, and timeline visualization of keywords, together with the strongest citation burst, the active research authors, countries/regions, and main research interests, as well as their evolution trends and collaborative relationships in the past 20 years. This paper finds that there is still a lack of systematic research and sufficient attention to the application of ANNs in CM. Furthermore, ANN applications still face many challenges such as data collection, cleaning and storage, the collaboration of different stakeholders, researchers and countries/regions, as well as the systematic design for the needed platforms. The findings are valuable to both the researchers and industry practitioners who are committed to ANNs in CM.
Article
Full-text available
Purpose The purpose of this paper is to investigate the potential integration of deep learning (DL) and digital twins (DT), referred to as (DDT), to facilitate Construction 4.0 through an exploratory analysis. Design/methodology/approach A mixed approach involving qualitative and quantitative analysis was applied to collect data from global industry experts via interviews, focus groups and a questionnaire survey, with an emphasis on the practicality and interoperability of DDT with decision-support capabilities for process optimization. Findings Based on the analysis of results, a conceptual model of the framework has been developed. The research findings validate that DL integrated DT model facilitating Construction 4.0 will incorporate cognitive abilities to detect complex and unpredictable actions and reasoning about dynamic process optimization strategies to support decision-making. Practical implications The DL integrated DT model will establish an interoperable functionality and develop typologies of models described for autonomous real-time interpretation and decision-making support of complex building systems development based on cognitive capabilities of DT. Originality/value The research explores how the technologies work collaboratively to integrate data from different environments in real-time through the interplay of the optimization and simulation during planning and construction. The framework model is a step for the next level of DT involving process automation and control towards Construction 4.0 to be implemented for different phases of the project lifecycle (design–planning–construction).
Article
Full-text available
The use of social media has become a regular habit for many and has changed the way people interact with each other. In this article, we focus on analyzing whether news headlines support tweets and whether reviews are deceptive by analyzing the interaction or the influence that these texts have on the others, thus exploiting contextual information. Concretely, we define a deep learning method for relation–based argument mining to extract argumentative relations of attack and support. We then use this method for determining whether news articles support tweets, a useful task in fact-checking settings, where determining agreement toward a statement is a useful step toward determining its truthfulness. Furthermore, we use our method for extracting bipolar argumentation frameworks from reviews to help detect whether they are deceptive. We show experimentally that our method performs well in both settings. In particular, in the case of deception detection, our method contributes a novel argumentative feature that, when used in combination with other features in standard supervised classifiers, outperforms the latter even on small data sets. © 2018, 2018 Association for Computational Linguistics Published under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.
Article
Full-text available
The aim of this study is to provide an overview the state-of-the-art elements of text classification. For this purpose, we first select and investigate the primary and recent studies and objectives in this field. Next, we examine the state-of-the-art elements of text classification. In the following steps, we qualitatively and quantitatively analyse the related works. Herein, we describe six baseline elements of text classification including data collection, data analysis for labelling, feature construction and weighing, feature selection and projection, training of a classification model, and solution evaluation. This study will help readers acquire the necessary information about these elements and their associated techniques. Thus, we believe that this study will assist other researchers and professionals to propose new studies in the field of text classification.
Article
With the ever increasing size of the web, relevant information extraction on the Internet with a query formed by a few keywords has become a big challenge. Query Expansion (QE) plays a crucial role in improving searches on the Internet. Here, the user’s initial query is reformulated by adding additional meaningful terms with similar significance. QE – as part of information retrieval (IR) – has long attracted researchers’ attention. It has become very influential in the field of personalized social document, question answering, cross-language IR, information filtering and multimedia IR. Research in QE has gained further prominence because of IR dedicated conferences such as TREC (Text Information Retrieval Conference) and CLEF (Conference and Labs of the Evaluation Forum). This paper surveys QE techniques in IR from 1960 to 2017 with respect to core techniques, data sources used, weighting and ranking methodologies, user participation and applications – bringing out similarities and differences.
Article
The decision-making system in Prefabrication Housing Production (PHP) lacks efficiency and collaboration because relevant information is stored and managed in heterogeneous systems of various stakeholders, who are commonly geographically isolated. Building Information Modeling (BIM) has advantages of combining its object-oriented attributes and the production-oriented characteristics of PHP to support decision-making and collaborative working for raising working efficiency. Given the emerging recognition of BIM and PHP, an integrated conceptual framework based on existing studies is needed in order to facilitate their recognition and guide future research. Based on a critical review of 65 papers published in the peer-reviewed journals from 2005 to 2017, a conceptual framework is proposed for the integration of BIM and PHP. The framework involves three pillars, including smart BIM platform (SBP), smart work packages (SWPs), and smart PHP objects (SPOs). A gateway with interoperability function is created between the three pillars to facilitate the communication and interaction with the central database. The results can help develop the system architecture of BIM and PHP, which can then be used to benefit various stakeholders to facilitate the integration.
Article
Patent classification is an essential task in patent information management and patent knowledge mining. However, this task is still largely done manually due to the unsatisfactory performance of current algorithms. Recently, deep learning methods such as convolutional neural networks (CNN) have led to great progress in image processing, voice recognition, and speech recognition, which has yet to be applied to patent classification. We proposed DeepPatent, a deep learning algorithm for patent classification based on CNN and word vector embedding. We evaluated the algorithm on the standard patent classification benchmark dataset CLEF-IP and compared it with other algorithms in the CLEF-IP competition. Experiments showed that DeepPatent with automatic feature extraction achieved a classification precision of 83.98%, which outperformed all the existing algorithms that used the same information for training. Its performance is better than the state-of-art patent classifier with a precision of 83.50%, whose performance is, however, based on 4000 characters from the description section and a lot of feature engineering while DeepPatent only used the title and abstract information. DeepPatent is further tested on USPTO-2M, a patent classification benchmark data set that we contributed with 2,000,147 records after data cleaning of 2,679,443 USA raw utility patent documents in 637 categories at the subclass level. Our algorithms achieved a precision of 73.88%.
Article
Patents contain a large quantity of information which is usually neglected. This information is hidden beneath technical and juridical jargon and therefore so many potential readers cannot take advantage of it. State of the art natural language processing tools and in particular named entity recognition tools, could be used to detect valuable concepts in patent documents. The purpose of the present research is to design a method capable of automatically detecting and extracting one of the multiple entities hidden in patents: the users of the invention. The method is based on a new approach tailored for users extraction by integrating state-of-the-art computational linguistics tools with a large knowledge base. Furthermore the paper shows a comparison among different machine learning algorithms with the twofold aim of achieving the highest recall and evaluating the performance in terms of precision and computational effort. Finally, a case study on two patent sets has been conducted to evaluate the effectiveness and the output of the entire tool-chain.
Conference Paper
We present a novel interactive framework for patent retrieval leveraging distributed representations of concepts and entities extracted from the patents text. We propose a simple and practical interactive relevance feedback mechanism where the user is asked to annotate relevant/irrelevant results from the top n hits. We then utilize this feedback for query reformulation and term weighting where weights are assigned based on how good each term is at discriminating the relevant vs. irrelevant candidates. First, we demonstrate the efficacy of the distributed representations on the CLEF-IP 2010 dataset where we achieve significant improvement of 4.6% in recall over the keyword search baseline. Second, we simulate interactivity to demonstrate the efficacy of our interactive term weighting mechanism. Simulation results show that we can achieve significant improvement in recall from one interaction iteration outperforming previous semantic and interactive patent retrieval methods.
Article
In this paper, we propose a transfer learning approach for Arrhythmia Detection and Classification in Cross ECG Databases. This approach relies on a deep convolutional neural network (CNN) pretrained on an auxiliary domain (called ImageNet) with very large labelled images coupled with an additional network composed of fully connected layers. As the pretrained CNN accepts only RGB images as the input, we apply continuous wavelet transform (CWT) to the ECG signals under analysis to generate an over-complete time–frequency representation. Then, we feed the resulting image-like representations as inputs into the pretrained CNN to generate the CNN features. Next, we train the additional fully connected network on the ECG labeled data represented by the CNN features in a supervised way by minimizing cross-entropy error with dropout regularization. The experiments reported in the MIT-BIH arrhythmia, the INCART and the SVDB databases show that the proposed method can achieve better results for the detection of ventricular ectopic beats (VEB) and supraventricular ectopic beats (SVEB) compared to state-of-the-art methods.
Article
Information and communications technology (ICT) has had major effects on the architecture, engineering, and construction (AEC) research fields in recent decades, but a comprehensive and in-depth review of how ICT has been used to enable different modes of information flow on project sites is missing from the current literature. To fill this gap, this paper defines a systematic approach for classifying information flow modes and uses this method to determine trends in information communication modes reported in recent publications. These trends were determined through the identification and analysis of 119 journal articles published between 2005 and 2015 in order to determine the mode of information flow reported. The results show that the majority of papers (70.6%) report a unidirectional flow of information, while a much smaller portion report one of two bidirectional information flow modes (26.9% non-automated, and 2.5% automated). The contribution of this work is in systematically defining current trends in ICT publications related to information flow and also in the identification of the typical technologies used to enable these communication modes.