ArticlePDF Available

Automated Information Transformation for Automated Regulatory Compliance Checking in Construction

Authors:

Abstract

To fully automate regulatory compliance checking of construction projects, regulatory requirements need to be automatically extracted from various construction regulatory documents and then transformed into a formalized format that enables automated reasoning. To address this need, the authors propose an approach for automatically extracting information from construction regulatory textual documents and transforming them into logic clauses that could be directly used for automated reasoning. This paper focuses on presenting the proposed information transformation (ITr) methodology and the corresponding algorithms. The proposed ITr methodology utilizes a rule-based, semantic natural language processing (NLP) approach. A set of semantic mapping (SeM) rules and conflict resolution (CoR) rules are used to enable the automation of the transformation process. Several syntactic text features (captured using NLP techniques) and semantic text features (captured using an ontology) are used in the SeM and CoR rules. A bottom-up method is leveraged to handle complex sentence components. A consume and generate mechanism is proposed to implement the bottom-up method and execute the SeM rules. The proposed ITr algorithms were tested in transforming information instances of quantitative requirements, which were automatically extracted from the International Building Code 2009, into logic clauses. The algorithms achieved 98.2 and 99.1% precision and recall, respectively, on the testing data.
1 Graduate Student, Dept. of Civil and Environmental Engineering, Univ. of Illinois at Urbana-Champaign,
205 N. Mathews Ave., Urbana, IL 61801.
2 Assistant Professor, Dept. of Civil and Environmental Engineering, Univ. of Illinois at Urbana-
Champaign, 205 N. Mathews Ave., Urbana, IL 61801 (corresponding author). E-
mail:gohary@illinois.edu; Tel: +1-217-333-6620; Fax: +1-217- 265-8039.
Automated Information Transformation for Automated Regulatory Compliance Checking
1
in Construction
2
Jiansong Zhang1; and Nora M. El-Gohary, A.M.ASCE2
3
Abstract
4
To fully automate regulatory compliance checking of construction projects, we need to
5
automatically extract regulatory requirements from various construction regulatory documents,
6
and transform these requirements into a formalized format that enables automated reasoning. To
7
address this need, the authors propose an approach for automatically extracting information from
8
construction regulatory textual documents and transforming them into logic clauses that could be
9
directly used for automated reasoning. This paper focuses on presenting the proposed
10
information transformation (ITr) methodology and the corresponding algorithms. The proposed
11
ITr methodology utilizes a rule-based, semantic natural language processing (NLP) approach. A
12
set of semantic mapping (SeM) rules and conflict resolution (CoR) rules are used to enable the
13
automation of the transformation process. Several syntactic text features (captured using NLP
14
techniques) and semantic text features (captured using an ontology) are used in the SeM and
15
CoR rules. A bottom-up method is leveraged to handle complex sentence components. A
16
“consume and generate” mechanism is proposed to implement the bottom-up method and
17
execute the SeM rules. The proposed ITr algorithms were tested in transforming information
18
instances of quantitative requirements, which were automatically extracted from the International
19
Building Code 2009, into logic clauses. The algorithms achieved 98.2% and 99.1% precision and
20
recall, respectively, on the testing data.
21
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
2
CE Database subject headings: Project management; Construction management; Information
22
management; Computer applications; Artificial intelligence.
23
Author keywords: Automated compliance checking; Automated information extraction;
24
Automated information transformation; Natural language processing; Semantic systems;
25
Automated construction management systems.
26
Introduction
27
Construction projects must comply with a host of regulations. The manual process of compliance
28
checking is, thus, time-consuming, costly, and error-prone (Han et al. 1998; Nguyen 2005;
29
Zhang and El-Gohary 2013c). Automated compliance checking (ACC), as an alternative to
30
manual checking, is expected to reduce the time, cost, and errors of compliance checking (CC)
31
(Tan et al. 2010; Salama and El-Gohary 2013b). In addition, ACC has many other potential
32
benefits, such as: (1) allowing earlier identification of potential non-compliance instances, which
33
could save significant time and cost caused by design modification and/or rework (Ding et al.
34
2006); (2) promoting the adoption of building information modeling (BIM) and increasing the
35
cumulative benefits of adopting BIM, since BIM would enable ACC (Pocas Martins and
36
Abrantes 2010); (3) enabling more efficient incorporation of stakeholder input into project
37
design and exploration of what-if design scenarios, since a designer would be better able to
38
experiment with different design options and check their compliance in a more time-efficient
39
manner (Niemeijer et al. 2009); and (4) reducing violations of regulations due to easier and more
40
frequent CC (Zhong et al. 2012).
41
Due to the many anticipated benefits of ACC, many efforts were undertaken in the area of ACC
42
in construction. The start of these efforts could be dated back to the 1960s, when Fenves et al.
43
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
3
(1969) formalized the American Institute of Steel Construction (AISC) specifications into
44
decision tables. These efforts took various approaches to ACC and focused on various ACC
45
purposes (or subdomains). For example, Garrett and Fenves (1987) proposed a strategy to
46
represent design standards using information networks and represent design component
47
properties using data items for ACC of structural designs; Ding et al. (2006) proposed an
48
approach to represent building codes using object-based rules and represent designs using an
49
Industry Foundation Classes (IFC)-based internal model for ACC of accessibility regulations;
50
Tan et al. (2010) proposed an approach to represent building codes and design regulations using
51
decision tables and incorporate simulation results in building information models for ACC of
52
building envelope design; the CORENET (Construction and Real Estate NETwork) project of
53
Singapore (Khemlani 2005) used an approach to represent design information using semantic
54
objects in the FORNAX library (i.e., a C++ library) and represent regulatory rules using
55
properties and functions in FORNAX objects for ACC of building control regulations, barrier
56
free access, and fire code, etc.; and the SMARTcodes project (ICC 2012) of the International
57
Code Council (ICC) used an approach to represent ICC codes in computer-processable tuple
58
format and represent designs using an IFC-based model for ACC of designs with ICC codes.
59
These efforts have all been very important in supporting ACC, and have shown the possibilities
60
of ACC through different system designs and implementations. However, despite their
61
importance, these efforts are limited in their automation capability; existing ACC efforts/systems
62
still require manual effort for the extraction of regulatory requirements from regulatory
63
documents and encoding them in a computer-processable format (Zhong et al. 2012; Zhang and
64
El-Gohary 2013c). To achieve full automation of ACC, this extraction and encoding process
65
needs to be fully automated.
66
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
4
To address this gap, the authors are proposing a new approach for automated rule extraction and
67
formalization for supporting ACC (Zhang and El-Gohary 2013a; Zhang and El-Gohary 2013b).
68
The approach utilizes semantic modeling and semantic Natural Language Processing (NLP)
69
techniques (for both information extraction and information transformation) to facilitate
70
automated textual regulatory document analysis (e.g., code analysis) and processing for
71
extracting requirements/rules from these documents and formalizing these requirements/rules in
72
a meaning-rich, computer-processable format. The approach involves developing a set of
73
algorithms and combining them into one computational platform: (1) machine-learning-based
74
algorithms for text classification (TC), (2) rule-based, semantic NLP algorithms for information
75
extraction (IE), and (3) rule-based, semantic NLP algorithms for information transformation
76
(ITr). This paper focuses on presenting the methodology and algorithms for ITr.
77
Proposed Approach for Automated Rule Extraction and Formalization for Automated
78
Compliance Checking
79
Proposed Approach
80
A five-phase, iterative approach for automatically extracting regulatory requirements/rules from
81
textual regulatory documents and formalizing these requirements in a logic format for further
82
automated reasoning is proposed (Figure 1). The five phases are: text classification (TC),
83
information extraction (IE), information transformation (ITr), implementation, and evaluation.
84
TC, IE, and ITr are the main processing phases.
85
Insert Figure 1
86
TC recognizes relevant sentences in a regulatory text corpus. Relevant sentences are the
87
sentences that contain the types of requirements that are relevant for an ACC scenario (e.g.,
88
environmental requirements in the scenario of environmental CC). Target information in those
89
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
5
relevant sentences are extracted and transformed in later IE and ITr processes. The TC process,
90
thus, filters out irrelevant sentences, thereby saving unnecessary processing of irrelevant
91
sentences. Such filtering also avoids unnecessary extraction and transformation errors that may
92
be caused by the processing of irrelevant sentences. The presentation of the TC algorithms and
93
results is outside the scope of this paper. For further details on the authors’ work in TC, the
94
reader is referred to Salama and El-Gohary (2013a).
95
IE recognizes the words and phrases in the relevant sentences that carry target information,
96
extracts information from these words/phrases, and labels them with pre-defined information
97
tags. An information tag is a symbol/name indicating a certain type of meaning. For example, the
98
information tag ‘subject’ carries the semantic meaning that the information instance is a “thing”
99
(e.g., building object) that is subject to a particular regulation or norm; while the information tag
100
‘JJ’ carries the syntactic meaning that the information instance is an adjective that describes a
101
noun as a modifier. Target information is the information needed to check a specific type of
102
regulatory requirement. For example, for quantitative requirements, the quantified
103
values/measurements of specific properties/attributes are target information. For IE by itself, a
104
seven-phase, iterative methodology is utilized. In the IE methodology, a set of pattern-matching-
105
based IE rules are used. Both syntactic (i.e., related to syntax and grammar, such as part-of-
106
speech (POS) tags) and semantic (i.e., related to context and meaning, such as ontology concepts
107
and relations) text features are used in the IE rules. The presentation of the IE algorithms and
108
results is outside the scope of this paper. For further details on the authors’ work in the area of IE,
109
the reader is referred to Zhang and El-Gohary (2013c).
110
ITr takes the extracted information instances and transforms them into logic clauses (i.e., logic
111
statements that can be further used in logic programs) using a set of pattern-matching-based rules.
112
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
6
Two types of rules are utilized for ITr: semantic mapping (SeM) rules and conflict resolution
113
(CoR) rules. Several syntactic and semantic text features are used in the rules. A bottom-up
114
method is utilized to handle complex sentence components. A “consume and generate”
115
mechanism is proposed to implement the bottom-up method and execute the SeM rules. The
116
following sections present and discuss the proposed ITr methodology in more detail. The
117
experimental implementation of the methodology in processing quantitative requirements from
118
Chapter 19 of the International Building Code (IBC) 2009 is also presented.
119
Comparison to the State-of-the-Art
120
In recent years, a number of research efforts, in domains such as software engineering (Breaux
121
and Anton 2008; Kiyavitskaya et al. 2008) and legal compliance (Wyner and Peters 2011), have
122
been studying the extraction of regulatory rules from textual documents. Most of these efforts (1)
123
require manual annotation or mark-up of textual documents; and (2) aim at processing text at a
124
coarser granularity level, i.e., process text into text segments rather than term-level
125
concepts/relations. On the other hand, the proposed approach (1) does not require manual
126
annotation or mark-up of textual documents; and (2) aims at processing text into concepts and
127
relations at the term level (i.e., aims at performing a deeper level of NLP). To the best of the
128
authors’ knowledge, the only work that has taken a somewhat similar approach to the proposed
129
one since it also does not require manual annotation/mark-up and aims at term-level processing,
130
in addition to utilizing a semantic and logic-based approach is that by Wyner and Governatori
131
(2013). Wyner and Governatori (2013) have conceptually explored and analyzed the use of
132
semantic parsing and defeasible logic for regulatory rule representation. In comparison, the
133
proposed approach (1) utilizes both syntactic and semantic text features in an integrated way
134
rather than utilizing only semantic information: the use of syntactic text features in addition to
135
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
7
semantic ones allows for handling more complex expressions, (2) uses a domain ontology for
136
capturing domain-specific semantic information rather than using generic semantic information
137
produced through generic semantic parsing: capturing and using semantic text features based on
138
domain-specific meaning allows for unambiguous interpretation of concepts/relations/terms (e.g.,
139
“bridge” as an infrastructure instead of the card game) and identification of implicit semantic
140
relations (e.g., fly ash is a type of cementitious material”), (3) uses first order logic (FOL)
141
rather than defeasible logic: FOL is the most widely used in automated reasoning and has been
142
extensively verified for expressivity and simplicity, and (4) has advanced to the stages of
143
implementation, testing, and evaluation: this allows for assessing the validity of the proposed
144
approach using measures of precision and recall.
145
Background
146
Natural Language Processing (NLP)
147
NLP is a subfield of artificial intelligence (AI) that aims at making natural language text or
148
speech computer-understandable, so that the text or speech could be processed by computers in a
149
human-like manner (Cherpas 1992). Examples of NLP-enabled applications include automated
150
natural language translation and automated text summarization (Marquez 2000). Examples of
151
NLP subtasks include tokenization, POS tagging, semantic role labeling (Gildea and Jurafsky
152
2002), and named entity recognition (Roth and Yih 2004). NLP tasks may take two main
153
approaches: a machine learning (ML)-based approach or a rule-based approach. A ML-based
154
approach utilizes ML algorithms for text processing (e.g., Pradhan et al. 2004), whereas a rule-
155
based approach utilizes manually-coded rules (e.g., Soysal et al. 2010). Rule-based methods
156
require more human effort for rule development, but tend to show better text processing
157
performance (Crowston et al. 2010). From another viewpoint, NLP approaches could be either
158
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
8
shallow or deep. Shallow NLP conducts partial analysis of a sentence or extracts partial, specific
159
information from a sentence (e.g., entities or concepts). Deep NLP aims at full sentence analysis
160
towards capturing the entire meaning of a sentence (Zouaq 2011). The state-of-the-art in NLP
161
has achieved reasonable performances for shallow NLP tasks, whereas it is still being challenged
162
by deep NLP tasks. Deep NLP requires elaborate knowledge representation and reasoning which
163
remains to be a challenge for AI (Tierney 2012).
164
In the construction domain, there has been a number of important research efforts that have
165
utilized NLP techniques. For example, Caldas and Soibelman (2003) have conducted ML-based
166
text classification of construction documents. For an overview of some of these efforts, the
167
reader is referred to Zhang and El-Gohary (2013c).
168
Rule-Based NLP using Pattern-Matching-Based Rules
169
Pattern-matching-based rules are widely used in NLP tasks such as POS tagging (Abney 1997;
170
Yin and Fan 2013), information extraction (Califf and Mooney 2003), and text understanding
171
(Goh et al. 2006). The idea of pattern-matching-based rules is to define a set of results when the
172
matching of a pattern of a specific sequence (or structure like a tree) of elements (e.g., characters,
173
tokens, symbols, terms, concepts) occurs. Pattern-matching-based rules have a variety of
174
implementations tailored to different purposes and domains. But, they all share the same rule
175
schema of “if pattern then result” or the mapping of “from pattern to result”. For example, in the
176
proposed SeM rules, the result is the transformation of information instances into logic clause
177
elements; while in the proposed CoR rules, the result is the deletion or conversion of certain
178
information instances and/or their information tags to resolve conflicts.
179
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
9
Semantic Modeling and Semantic NLP
180
A semantic model aims at capturing the meanings of a domain or topic, usually in a structured
181
manner. Ontology is a widely-used type of semantic model; it is defined as “an explicit
182
specification of a conceptualization” (Gruber 1995). An ontology is, commonly, composed of
183
concept hierarchies, relationships between concepts, and axioms. The axioms are used together
184
with the concepts and relationships to define the semantic meaning of the conceptualization. An
185
ontology is easily reusable and extendable (El-Gohary and El-Diraby 2010). The use of a
186
semantic model could help in NLP tasks. For example, semantic-based IE has been shown to
187
achieve better performance than syntactic-only IE (Soysal et al. 2010; Zhang and El-Gohary
188
2013c).
189
Logic-Based Information Representation and Reasoning
190
There are several types of formally-defined logic with varying degrees of descriptive capabilities
191
(prepositional logic, predicate logic, modal logic, description logic, etc.). Among the different
192
types, FOL is the most widely-used for logic-based inference-making. A Horn Clause (HC) is
193
one of the most restricted forms of FOL. Inference-making in FOL is most efficient using HC
194
logic clauses, because of such restricted form (Saint-Dizier 1994). A HC is composed of a
195
disjunction of literals of which at most one is positive. All HCs can be represented as rules that
196
have one or more antecedents (i.e., left-hand sides (LHSs)) that are conjoined (i.e., combined
197
using and operator), and a single positive consequent (i.e., right-hand side (RHS)). For example,
198
compliant(T) :- thickness(T) , exterior_basement_wall(W) , has(W,T) ,
199
greater_than_or_equal(T, quantity(71/2, inches))is a HC; where “,” is the conjunctive operator
200
(i.e., “A , B” means “A and B”) and “:- is the implication operator (i.e., B :- A means A
201
implies B”). There are three types of HCs: (1) one or more antecedents and one consequent, (2)
202
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
10
zero antecedents and one consequent, and (3) one or more antecedents and zero consequents.
203
Inference-making using HCs could be automatically and efficiently conducted, which makes it
204
suitable for supporting automated reasoning for ACC.
205
Proposed Information Transformation Methodology
206
The proposed ITr takes a rule-based, semantic NLP approach. It utilizes pattern-matching-based
207
rules to automatically generate logic clauses based on the extracted information instances and
208
their associated patterns of information tags. Both syntactic information tags (i.e., tags tagging
209
syntactic text features, e.g., ‘adjective’ is represented using the POS tag ‘JJ’) and semantic
210
information tags (i.e., tags tagging semantic text features, e.g., ‘compliance checking attribute’ is
211
represented using the semantic tag “a”) are used in defining the patterns. A number of NLP
212
techniques (e.g., POS tagging, term matching) are used to identify the syntactic information tags
213
of each extracted information instance, and a semantic model (an ontology that represents
214
domain knowledge) is used to identify the semantic information tags. The tagged information
215
instances are transformed into HC-type logic clauses using a set of SeM rules and CoR rules.
216
SeM rules define how to process the extracted information instances, based on their associated
217
types of information tags and the context of the information tags, so that the extracted
218
information instances could be transformed correctly into logic clauses. CoR rules resolve
219
potential conflicts that may exist in the processing of different information tags. A bottom-up
220
method is utilized to handle complex sentence components. A “consume and generate”
221
mechanism is proposed to implement the bottom-up method and execute the SeM rules.
222
The following subsections present the proposed ITr methodology (Figure 2) in more detail.
223
Insert Figure 2
224
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
11
The Source: Extracted Information Instances
225
The information source for the ITr process is the set of input information instances that were
226
obtained from the preceding IE process. Information instances have been labeled with
227
information tags during IE. The implemented changes/improvements on the authors’ IE work
228
since Zhang and El-Gohary (2013c) are: (1) in addition to semantic information tags, syntactic
229
information tags and combinatorial information tags are also generated for further use in ITr; and
230
(2) instead of the top-down method for handling complex sentence components (processing
231
larger chunks of texts first, then breaking them down to process smaller chunks of texts), a
232
bottom-up method (processing smaller chunks of texts first, then aggregating them to process
233
larger chunks of texts) is adopted because in the experiments it has shown to achieve better
234
performance in handling complex sentence components (Zhang and El-Gohary 2013b). As such,
235
in the ITr process, the following three types of information tags (information tags will be shown
236
using single quotes hereafter) are defined and used: (1) semantic information tags, (2) syntactic
237
information tags, and (3) combinatorial information tags.
238
Semantic information tags are information tags that are related to the meaning and context of the
239
labeled information instances. Instances of semantic information tags are recognized based on
240
the concepts and relations in the domain ontology. For example, in the developed ontology, both
241
transverse reinforcement” and vertical reinforcement” are subconcepts of the concept ‘subject’.
242
Therefore, the appearances of transverse reinforcement (or transverse reinforcements”) and
243
vertical reinforcement” (or “vertical reinforcements”) in Chapter 19 of IBC 2009 will be
244
extracted as instances of the semantic information tag ‘subject’. The decision on which concepts
245
and relations are essential to extract and transform is based on the type of requirement (e.g.,
246
quantitative requirements) that is being checked. For example, ‘subject’ is one example of a
247
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
12
semantic information tag that is essential in the context of compliance checking of quantitative
248
requirements.
249
Syntactic information tags are information tags that are related to the grammatical role of the
250
labeled information instances. Instances of syntactic information tags are recognized based on
251
their syntactic features. Syntactic information tags carry information that is more general than
252
those carried by semantic information tags. For example, the syntactic information tag ‘noun’ is
253
describing the labeled information instance as a noun, while semantically the noun could
254
possibly belong to a ‘subject’, ‘compliance checking attribute’, or another semantic information
255
tag. In the proposed methodology, POS tags are mainly used as the syntactic features for
256
syntactic information tags. For example, ‘JJ’ is the POS tag for adjective. It is a syntactic
257
information tag for an information instance that describes properties/attributes of a noun. For
258
example, the adjective “habitable” in “habitable room” is describing the functional property of
259
“room”.
260
Combinatorial information tags are compound information tags that are composed of multiple
261
semantic and/or syntactic information tags. For example, the combination of ‘past participle verb’
262
(POS tag ‘VBN’) and ‘preposition’ (POS tag ‘IN’) is a combinatorial information tag
263
(combining two syntactic information tags) that describes a directional passive verbal relation
264
represented by bigrams like “provided by” and “located in”. The combination of ‘adjective’
265
(syntactic information tag - POS tag ‘JJ’) and ‘subject’ (semantic information tag‘s’) is another
266
example of a combinatorial information tag (combining syntactic and semantic information tags)
267
that describes a ‘subject’ with a certain property.
268
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
13
The Target: Logic Clauses
269
The target of the ITr process is the set of output logic clauses which are used to represent the
270
requirements in construction regulations. A HC format is used for such representation, in order to
271
facilitate further automated reasoning using logic programs. One single HC represents one
272
requirement. The RHS of the HC (in Prolog syntax the logical RHS appears to the left of “:-”)
273
indicates the compliance result(s). The LHS of the HC encodes the conditions for the
274
requirement using one or more predicates. Each predicate defines either a concept information
275
instance (e.g., court(C)) or a relation information instance (e.g., has(C,W)). The logic clause
276
elements in a concept predicate are called concept logic clause elements. The logic clause
277
elements in a relation predicate are called relation logic clause elements. Table 1 shows the
278
source and target for a sample sentence.
279
Insert Table 1
280
Semantic Mapping (SeM) Rules
281
The semantic mapping (SeM) rules define how to process the extracted information instances
282
according to their semantic meaning. The semantic meaning of each information instance is
283
defined by: (1) the information tag it is associated with. For example, in Table 1, ‘subject’
284
defines the semantic meaning of “court”, i.e., it defines that “court” is the ‘subject’ of
285
compliance checking; and (2) the context of the extracted information instance, reflected by the
286
information tags of its surrounding information instances. For example, in the following sentence,
287
the semantic meaning of “not less than” (instance of ‘comparative relation’) is defined by the
288
information tag of its surrounding information instance “for each”: “The minimum net area of
289
ventilation openings shall not be less than 1 square foot for each 150 square feet of crawl space
290
area”. “For each”, here, indicates that “not less than” (relation) is not simply a relationship
291
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
14
between “net area” (instance of ‘compliance checking attribute’) and “1 square foot” ( instance
292
of ‘quantity value + ‘quantity unit’), but it is also restricted by “150 square feet of crawl space
293
area” (instance of a quantity value’ + ‘quantity reference’). The interpretation of this
294
requirement is that the quantity requirement on “minimum net area of ventilation openings” will
295
increase 1 foot for each additional “150 square feet of crawl space area”.
296
The semantic meanings of information instances are utilized in patterns on the LHS of SeM rules.
297
For the example in Table 1, the corresponding SeM rule pattern is ‘subject’ + ‘modal verb’ +
298
‘negation’ + ‘be’ + ‘comparative relation’ + ‘quantity value’ + ‘quantity unit’ + ‘preposition’ +
299
‘compliance checking attribute’. An SeM rule with this LHS pattern will transform the
300
information instances into the logic clause shown in the last row of Table 1. A sample action
301
defined on the RHS of this SeM rule is: Generate predicates for the ‘subject’ information
302
instance, the ‘attribute’ information instance, and a ‘has’ information instance. The two
303
arguments of the ‘has’ information instance are from the ‘subject’ predicate and the ‘attribute’
304
predicate, respectively”. Accordingly, the following logic clause elements are generated for the
305
following statement, since “court” is recognized as a ‘subject’ information instance and “width”
306
as an ‘attribute’ information instance.
307
Sentence: “Courts shall not be less than 3 feet in width”
308
Logic Clause Elements: court(Court), width(Width), has(Court,Width)
309
The ITr method is intended to process each term of a sentence in a sequential manner. In general,
310
sequential processing for information transformation normally requires information instances
311
that are matched by patterns (in SeM rules) to be strictly located next to each other. Such a rigid
312
processing requirement could cause difficulty in processing sentences with different structures.
313
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
15
To avoid that, the proposed SeM rules do not follow such a rigid requirement. Instead, the SeM
314
rules allow for “look-back searching” (i.e., searching to the left of the matched words) and “look-
315
ahead searching” (i.e., searching to the right of the matched words) to find instances that match
316
certain information tags in a pattern. For example, in the following pattern, the instance of the
317
first ‘subject’ does not have to be located right next to the instance of ‘preposition’: “ ‘subject’ +
318
‘preposition’ + ‘subject’ . It is only required to be the ‘subject’ instance that is closest to the
319
‘preposition’ instance from the left. “Look-back searching”, here, searches to the left of the
320
matched word for ‘preposition’ to find the closest ‘subject’ instance when the later part of the
321
pattern ‘preposition’ + ‘subject’ is matched. This allows for more flexibility in the use of
322
SeM rules to handle sentence complexities (e.g., those incurred by cases such as tail recursive
323
nested clauses). For example, an SeM rule uses the following pattern P1 to match the last three
324
information instances in InS1 (‘s’ for ‘subject’, ‘VBP’ for ‘non-3rd person singular present verb’,
325
‘dpvr’ for ‘directional passive verbal relation’, and ‘VB’ for ‘base form verb’), finds the first
326
information instance in InS1 through “look-back searching”, and generates the logic clause
327
elements LC1 for the part of sentence S1:
328
Pattern P1: ‘non-3rd person singular present verb’ ‘directional passive verbal relation’
329
‘base form verb’
330
Information Instances InS1: (‘connection’, ‘s’) … (‘are’, ‘VBP’), (‘designed_to’, ‘dpvr’),
331
(‘yield’, ‘VB’)
332
Sentence S1: “Connections that are designed to yield shall be capable of …”
333
Logic Clause Elements LC1: connection(Connection), yield(Yield),
334
designed_to(Connection,Yield)
335
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
16
In the proposed methodology, application-specific SeM rules are developed based on a
336
randomly selected sample of text (called “development text”, which is also used for text analysis
337
and further development of CoR rules). For developing a set of SeM rules for ITr, a three-step,
338
iterative methodology that shall be applied to each sentence is proposed: (1) find all relations in a
339
sentence (e.g., “of” and “not exceed” in the sentence “Spacing of transverse reinforcement shall
340
not exceed 8 inches.”), (2) for each relation, run the existing SeM rule set to check if the rule set
341
can generate the corresponding logic clause elements correctly and define the subsequent action
342
based on the following three cases: (a) if the corresponding logic clause elements are correctly
343
generated, then move to check the next relation, (b) if the corresponding logic clause elements
344
are incorrectly generated, then create a new SeM rule with a more specific pattern (i.e., a longer
345
pattern with more features) than the applied SeM rule and add it to the rule set with a higher
346
priority, and (c) if the corresponding logic clause elements are not generated, then create a new
347
SeM rule and add it to the rule set; and (3) after all relations have been checked, run the updated
348
SeM rule set on all checked sentences and check if errors have been introduced due to the added
349
SeM rules. If errors have been introduced, then identify the source(s) of errors (i.e., the rule(s)
350
that introduced the errors) and adjust those rules as necessary.
351
Conflict Resolution (CoR) Rules
352
The conflict resolution (CoR) rules resolve conflicts between information tags. Two types of
353
CoR rules are used: deletion CoR rules and conversion CoR rules. Deletion CoR rules resolve
354
conflicts between information tags by deleting certain information instances. For example, the
355
following deletion CoR rule CoR1 is used to delete redundant information instances InS2 (‘cr’
356
for ‘candidate restriction’) from the set of extracted information instances InS3 (‘s’ for ‘subject’)
357
for the sentence S2:
358
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
17
Deletion CoR Rule CoR1: “if an information instance has the tag ‘subject’ and it
359
subsumes its following information instance(s), then delete its following information
360
instance(s).”
361
Information Instances InS2: (‘exterior’, ‘cr’), (‘basement’, ‘cr), (‘wall’, ‘cr’)
362
Information Instances InS3: (‘exterior basement wall’, ‘s’), (‘exterior’, ‘cr’), (‘basement’,
363
cr’), (‘wall’, ‘cr’)
364
Sentence S2: “The thickness of exterior basement walls and foundation walls shall be not
365
less than 71/2 inches.”
366
Conversion CoR rules resolve conflicts between information tags by converting information tags
367
of information instances into other types of information tags. For example, the following
368
conversion CoR rule CoR2 is used to convert information tags in information instances InS4 (‘s’
369
for ‘subject’, ‘I’ for ‘inter clause boundary relation’, and ‘a’ for ‘compliance checking attribute’)
370
to information tags in information instances InS5 (‘IN’ for ‘preposition’) for the sentence S3:
371
Conversion CoR Rule CoR2: “if ‘with’ is directly followed by an information instance
372
that has the information tag ‘compliance checking attribute’ and ‘with’ has the
373
information tag ‘inter clause boundary relation’, then convert the information tag of ‘with’
374
to ‘preposition’.”
375
Information Instances InS4: (wall segment’, ‘s’), (‘with’, ‘I’),
376
(‘horizontal_length_to_thickness_ratio’, ‘a’)
377
Information Instances InS5: (‘wall segment’, ‘s’), (‘with’, IN’),
378
(‘horizontal_length_to_thickness_ratio’, ‘a’)
379
Sentence S3: “Wall segments with a horizontal length-to-thickness ratio less than 2.5
380
shall be designed as columns.
381
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
18
In the proposed rule-based ITr, the CoR rules are executed before the SeM rules, after the
382
information instances have been extracted by the IE process. The development of CoR rules is
383
needed when conflicts between SeM rules cannot be resolved by adjusting SeM rule patterns and
384
actions. For developing a set of CoR rules for ITr, a five-step methodology is proposed: (1) find
385
information tags that are the sources of errors through pattern analysis of conflicting SeM rules,
386
(2) for each conflict, create a new candidate CoR rule to resolve the conflict, (3) try the candidate
387
rule and empirically analyze whether the conflict was resolved without introducing new conflicts
388
or not, (4) if the trial was successful, then add the candidate CoR rule as a new rule to the
389
existing CoR rule set, and if the trial was unsuccessful, then iterate Steps 3 and 4 until a
390
successful trial is found, and (5) after each new CoR rule is added, check all SeM rules and
391
update them as necessary according to the changes in information tags caused by the new CoR
392
rule.
393
Bottom-up Method for Handling Complex Sentence Components
394
Due to the variability of natural language expressions and structures, sentences used in
395
regulatory provisions could be very complex. For example, phrases and clauses could be
396
continuously attached/nested to a sentence to constantly enrich it with more relevant information.
397
Complex sentences are difficult to process for information extraction and transformation.
398
Complex sentence components are intermediately-processed segments of text that are: (1)
399
expressed using a variety of natural language structure patterns, and (2) composed of multiple
400
concepts and relations. Complex sentence components are more likely to result in complex
401
sentence structures by embedding in or attaching more concepts and relations to a sentence.
402
Figure 3 shows a complex sentence from IBC 2006. Two methods were explored in handling
403
complex sentence components:top-down method and bottom-up method (Figure 4). The top-
404
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
19
down method starts from the top level (i.e., full sentence) and proceeds down to identify and
405
process complex sentence components. The bottom-up method starts from the lowest level (i.e.,
406
single terms/concepts/relations in a sentence) and proceeds up to identify and process complex
407
sentence components. The bottom-up method is employed in the proposed ITr approach, because
408
based on the authors’ previous work it has shown to achieve better performance than the top-
409
down method (Zhang and El-Gohary 2013b).
410
Insert Figure 3
411
Insert Figure 4
412
In the bottom-up method, the SeM rules are used to process sentences starting from the lowest
413
level, i.e., starting from information instances (which correspond to single
414
terms/concepts/relations in a sentence). The information instances in the source text are put into
415
lists one list for each sentence and are processed one by one until all information instances
416
have been processed. The order of the instances in the list is determined based on their order in
417
the original sentence.
418
To apply the bottom-up method, the authors propose a new “consume and generate” mechanism
419
to execute the SeM rules in a sequential manner. This mechanism follows the heuristics of the
420
“sliding window” method in computational research (i.e., a sequence of data is sequentially
421
processed, segment by segment, and each segment has a predefined fixed length (i.e., the
422
“window size”)) and the mechanism of transcription in genetics domain (i.e., a sequence of DNA
423
is sequentially transcribed, segment by segment, and each segment has a length of about 17 base-
424
pair). The “consume and generate” mechanism processes all text segments that match an SeM
425
rule pattern, where each segment matches a pattern of one SeM rule and each pattern consists of
426
information tags for a sequence of information instances. However, in comparison to the “sliding
427
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
20
window” method, the segment length in the proposed “consume and generate” mechanism is not
428
fixed across patterns to allow for flexibility in capturing complex sentence structures. The length
429
of each segment is determined according to the number of information tags in the corresponding
430
SeM rule pattern. For example, the following pattern P2 has a segment length of three and
431
matches the information instances InS6 for the part of sentence S4 to generate logic clause
432
elements LC2:
433
Pattern P2: compliance checking attributeof subject
434
Information Instances InS6: (‘area’, ‘a’), (‘of’, ‘OF’), (‘space’, ‘s’)
435
Sentence S4: The net free ventilating area shall not be less than 1/150 of the area of the
436
space ventilated …”
437
Logic Clauses Elements LC2: space(Space), area(Area), has(Space, Area)
438
The “consume and generate” mechanism allows for backward matching: if information instances
439
extracted from a segment of text match the later part of a pattern, then the information instance(s)
440
extracted from preceding text are checked for matching of the earlier part of the same pattern,
441
and corresponding logic clauses are generated if the check succeeds. For example, the following
442
information tags InT1 are associated with the five information instances from the part of
443
sentence S5. After the first three information instances InS7 are processed based on matching
444
with the pattern P3, two information instances “or” and “space” remain. These two remaining
445
information instances only match the later part (i.e., second and third information tags) of the
446
pattern P4 for ‘conjunctive subject’. Normally, this partial matching would not initiate the
447
processing of the information instances. However, under the proposed backward matching
448
mechanism, the preceding information instance “interior room” is checked for the matching of
449
the earlier part of the pattern for “conjunctive subject” (i.e., the first information tag: ‘subject’).
450
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
21
Since “interior room” matches ‘subject’, the SeM rule for “conjunctive subject” gets applied and
451
the two remaining information instances are processed to generate the logic clause elements LC3
452
(where “;” is the disjunctive operator (i.e., “A ; B” means “A or B”)).
453
Information Tags InT1: ‘compliance checking attribute’, ‘of’, ‘subject’, ‘conjunctive
454
term’, ‘subject’
455
Sentence S5: “…the floor area of the interior room or space…”
456
Information Instances InS7: “floor area”, “of”, “interior room”
457
Pattern P3: ‘compliance checking attribute’ + ‘of’ + ‘subject’
458
Pattern P4: ‘subject’ + ‘conjunctive term’ + ‘subject’
459
Logic Clause elements LC3: interior_room(Interior_room); space(Interior_room)
460
Validation
461
Results are evaluated in terms of precision, recall, and F1 measure. Precision is the number of
462
correctly generated logic clause elements divided by the total number of generated logic clause
463
elements. Recall is the number of correctly generated logic clause elements divided by the total
464
number of logic clause elements that should be generated. F1 measure is the harmonic mean of
465
precision and recall, assigning equal weights to precision and recall. Ideally, both 100% recall
466
and precision are desired. However, given the inherent trade-off between the two measures, it is
467
difficult to achieve such a result. The ultimate goal for ACC is, therefore, to achieve 100% recall
468
of non-compliance instances with high precision.
469
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
22
Experimental Implementation and Validation
470
For testing and validation, the proposed ITr methodology was empirically implemented in
471
transforming information instances of quantitative requirements, which were automatically
472
extracted from the IBC 2009, into logic clauses.
473
Source Text Selection
474
The proposed ACC approach and ITr methodology are intended to process information from a
475
variety of construction-related textual regulatory documents (e.g., building codes, environmental
476
regulations, safety regulations and standards). Since building codes are the primary sets of
477
regulations governing the design, construction, operation, and maintenance of residential and
478
commercial buildings, they were chosen for testing the proposed ITr methodology. In the U.S.,
479
almost all state authorities (except for Delaware, Massachusetts, Mississippi, and Missouri)
480
adopt versions of the IBC by ICC. Thus, IBC was selected as the source text corpus. More
481
specifically, IBC 2006 and IBC 2009 were selected because of their availability and easiness for
482
comparison (with the authors’ previous NLP work in which IBC 2006 and IBC 2009 were used
483
for testing and validation) (Zhang and El-Gohary 2013c).
484
The SeM and CoR rules were developed based on Chapters 12 and 23 of IBC 2006, and the
485
proposed ITr algorithms were tested in processing information instances of “quantitative
486
requirements” that were extracted from Chapter 19 of IBC 2009. A quantitative requirement is a
487
requirement which defines the relationship between an attribute of a certain building
488
element/part and a specific quantity value (or quantity range). For example, the following
489
sentence, states that the width (attribute) of court (building element/part) should be greater than
490
or equal to 3’ (quantity value): “Couts shall not be less than 3 feet in width”. The authors decided
491
to The experiment on the extraction of quantitative requirements because: (1) IBC 2006 and IBC
492
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
23
2009 describe many quantitative requirements (e.g., on average, quantitative requirements
493
represent 41% of the requirements in Chapters 12 and 23 of IBC 2006 and Chapter 19 of IBC
494
2009), which ensures a sufficient amount of relevant sentences for development and testing; and
495
(2) sentences describing quantitative requirements appear to be more complex than those
496
describing other types of requirements (e.g., existential requirements, which requires the
497
existence of a certain building element/part), which implies that they are more difficult to
498
process. This makes quantitative requirements good candidates for testing.
499
Tool Selection
500
The proposed TC, IE, and ITr algorithms were combined into one computational platform. The
501
representation of Prolog was selected for logic clause representation, in order to facilitate future
502
CR. Prolog is an approximate realization of the logic programming computational model on a
503
sequential machine (Sterling and Shapiro 1986). It is the most popular logic programming
504
language with a reasoner. The syntax of B-Prolog was used. B-Prolog is a Prolog system with
505
extensions for programming concurrency, constraints, and interactive graphics. It has bi-
506
directional interface with C and Java (Zhou 2012). To facilitate quantitative reasoning, a set of
507
built-in rules were developed to perform arithmetic and comparative operations on the proposed
508
quantitative representation. The TC and IE algorithms were implemented using the General
509
Architecture for Text Engineering (GATE) tools (Univ. of Sheffield 2013). GATE has a variety
510
of built-in tools for a variety of text processing functions (e.g., tokenization, sentence splitting,
511
POS tagging, gazetteer compiling, and morphological analysis). For ITr, the SeM rules and CoR
512
rules were implemented using Python programming language (v3.3.2). The “re” module (i.e.,
513
regular expression module) in Python was used for pattern matching, so that each extracted
514
information instance could be used for subsequent processing steps based on their information
515
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
24
tags (example tags are shown in Figure 3). A domain ontology was developed and used to
516
facilitate semantic IE and ITr. In developing the ontology, the ontology development
517
methodology in El-Gohary and El-Diraby (2010) was followed. The GATES’ built-in ontology
518
editor was used for ontology building and editing.
519
Information Representation
520
Two types of logic statements in B-Prolog syntax were utilized: facts and rules. A rule has the
521
form: “H :- B1, B2, …, Bn. (n>0)”. H, B1, …, Bn are atomic formulas. H is called the head, and
522
the RHS of ‘:-is called the body of the rule. A fact is a special kind of rule whose body is
523
always true (Zhou 2012). Each requirement rule in IBC 2006 and IBC 2009 is represented as one
524
single B-Prolog rule. Instances of concepts are represented using unary predicates. For example,
525
the information instance “floor” is represented by the predicate “floor(F)”, with “floor” being the
526
predicate name and the variable “F” (all variables in B-Prolog start with capitalized letter) being
527
the argument for the predicate. Instances of relations are represented using binary or n-ary
528
predicates. For example, “provided with” is a relation which is represented as the predicate
529
“provided_with(A,B)”, while the variables “A” and “B” could be defined in the predicates
530
interior_space(A) and space_heating_system(B). Each design fact, on the other hand, is
531
represented using one B-Prolog fact. The B-Prolog reasoner can then automatically reason about
532
the facts and rules and, accordingly, determine the compliance checking result(s). An example is
533
shown in Figure 5.
534
Insert Figure 5
535
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
25
Information Tags
536
A total of 40 information tags were developed for use in the SeM rules and CoR rules for ITr. A
537
total of 17, 22, and 1 semantic information tags, syntactic information tags, and combinatorial
538
information tags were used, respectively.
539
Two main types of semantic information tags were defined (as per Figure 6): essential
540
information tags and secondary information tags. Essential information tags are tags for
541
information that must be defined for this specific type of requirement. Six main types of essential
542
information tags were defined for quantitative requirements: subject, compliance checking
543
attribute, comparative relation, quantity value, quantity unit, and quantity reference. A ‘subject’
544
is an ontology concept; it is a “thing” (e.g., building object, space) that is subject to a particular
545
regulation or norm. A ‘compliance checking attribute’ is an ontology concept; it is a specific
546
characteristic of a ‘subject’ by which its compliance is assessed. A ‘comparative relation’ is an
547
ontology relation which is commonly-used for comparing quantitative values (i.e., comparing an
548
existing value to a required minimum or maximum value). Five subtypes of comparative
549
relations were further defined: ‘greater than or equal to’, ‘greater than’, ‘less than or equal to’,
550
‘less than’, and ‘equal to’. A ‘quantity value’ is a value, or a range of values, which defines the
551
quantified requirement. A ‘quantity unit’ is the unit of measure for the ‘quantity value’. A
552
‘quantity reference’ is a reference to another quantity (which includes a value and a unit).
553
Secondary information tags are tags for information that are not necessary for this specific type
554
of requirement, but may exist in defining the requirement. Two main types of secondary
555
information tags were defined for quantitative requirements: ‘restriction’ and ‘exception’. A
556
‘restriction’ is a concept that places a constraint on the ‘subject’, ‘compliance checking attribute’,
557
‘comparative relation’, pair of ‘quantity value’ and ‘quantity unit’, pair of ‘quantity value’ and
558
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
26
‘quantity reference’, or the full requirement. A ‘subject restriction’ is a concept that places a
559
constraint on the ‘subject’. Two subtypes of ‘subject restriction’ were further defined: ‘possesive
560
subject restriction’ and ‘nonpossesive subject restriction’. A ‘possesive subject restriction’ places
561
a possessive constraint on the ‘subject’, thereby restricting the ‘subject’ to possess certain
562
building parts or properties. For example, in the following requirement sentence, “having
563
windows opening on opposite sides” is a ‘possessive subject restriction’ on “court”: “Courts
564
having windows opening on opposite sides shall not be less than 6 feet in width”. A
565
‘nonpossesive subject restriction’ places a nonpossesive constraint on the ‘subject’, thereby
566
restricting the ‘subject’ not to possess certain building parts or properties. A ‘compliance
567
checking attribute restriction’ places a constraint on the ‘compliance checking attribute’, thereby
568
restricting the ‘compliance checking attribute’ to a more specific type. For example, in the
569
following requirement sentence, “to the outdoors” is a ‘compliance checking attribute restriction’
570
on “minimum openable area”: “The minimum openable area to the outdoors shall be 4 percent of
571
the floor area being ventilated”. A ‘comparative relation restriction’ places a constraint on the
572
‘comparative relation’, thereby restricting the ‘comparative relation’ using new conditions. For
573
example, in the following requirement sentence, for each 150 square feet of crawl space area” is
574
a ‘comparative relation restriction’ on “not less than”: “The minimum net area of ventilation
575
openings shall not be less than 1 square foot for each 150 square feet of crawl space area”. A
576
‘quantity restriction’ places a constraint on the ‘quantity value + ‘quantity unit’/’quantity
577
reference’ pair, thereby specifying the properties (e.g., range) of the pair. A ‘full requirement
578
restriction’ places a constraint on the whole quantitative requirement, thereby restricting the
579
quantitative requirement with new preconditions. An ‘exception’ defines a condition where the
580
described requirement does not apply.
581
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
27
For syntactic information tags, the Hepple POS Tagger was used to generate POS tag features.
582
Some additional syntactic features that were not in the Hepple POS Tagger (e.g., the preposition
583
“of”) were also defined. Each selected POS type and defined syntactic feature represents a
584
syntactic information tag such as adjective (POS tag ‘JJ’) and preposition “of” (the literal “OF”).
585
One combinatorial information tag was defined for use in this implementation and was called
586
‘directional passive verbal relation’, which is the combination of ‘past participle verb’ (POS tag
587
‘VBN’) and ‘preposition’ (POS tag ‘IN’). Combinatorial information tags are expressive and
588
flexible. Thus, more combinatorial information tags may be defined and used if more complex
589
information tags are needed to capture complex meanings or patterns.
590
Insert Figure 6
591
Gold Standard
592
The gold standard for Chapter 19 of IBC 2009 was developed semi-automatically. In the authors’
593
previous work, all sentences that include a number (both appearances of digits and words forms
594
of a number) were automatically extracted to ensure a 100% recall of sentences describing
595
quantitative requirements. Then, one of the authors manually deleted false positive sentences.
596
After that, one of the authors manually coded the logic clauses based on the extracted
597
information instances from each sentence. The gold standard was reviewed by two other
598
researchers to verify its correctness. Because of the unambiguous nature of quantitative
599
requirements, along with the well-defined information representation that is used in the proposed
600
methodology, there was an agreement in formulating the gold standard. For Chapter 19, 62
601
sentences containing quantitative requirements were recognized. Correspondingly, 62 logic
602
clauses were coded. In these 62 logic clauses, 1901 logic clause elements were identified,
603
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
28
including 568 logic clause elements for describing concepts and 1333 logic clause elements for
604
describing relations between concepts.
605
Algorithm Implementation
606
The proposed ITr methodology was implemented using Python programming language. The
607
processing steps of an example sentence and the pseudo codes for the main algorithm and the
608
“consume and generate” mechanism are shown in Figure 7, Figure 8, and Figure 9, respectively.
609
Insert Figure 7
610
Insert Figure 8
611
Insert Figure 9
612
As shown in Figure 7, the IE process tags the original sentence with information tags (from Part I
613
to Part II). The main ITr algorithm then represents each information instance in the tagged
614
sentence into a four-tuple (from Part II to Part III). The CoR rules in the main algorithm then
615
process the information instance tuple list to resolve conflicts between tuples (from Part III to
616
Part IV). The “consume and generate” code then executes the set of SeM rules to process each
617
tuple in the list and generate logic clause elements based on matching of SeM rule patterns (from
618
Part IV to Part V). For each information instance, the four-tuple is used to store: (1) the
619
information instance itself, (2) the location of the information instance in the corresponding
620
sentence (represented by the starting point of the information instance in the sentence), (3) the
621
length of the information instance in terms of number of letters, and (4) the information tag of
622
the information instance (e.g., ‘Interior’, 0, 15, and ‘s’ for the first information instance in Part
623
III of Figure 7).
624
In the main algorithm (Figure 8), the CoR rules are executed through the function “resolve
625
conflicts. Then, the SeM rules are executed using the “consume and generate” code to process
626
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
29
the conflict-free information instances for each sentence of the source text file (in the format of a
627
list of four tuples) to generate and display the corresponding logic clause. As shown in Figure 9,
628
the “consume and generate” code checks through the patterns for each SeM rule (PATTERN1,
629
PATTERN2, PATTERN3…) and generates logic clauses as a result of matching to SeM rules. In
630
case of no matching, the default negative step length enables backward matching.
631
Experimental Results and Discussion
632
The proposed ITr algorithms were tested in transforming information instances of quantitative
633
requirements, which were automatically extracted from Chapter 19 of IBC 2009, into logic
634
clauses. The following two experiments were conducted for comparing the performances of two
635
methods of information representation: (1) using essential semantic information tags only, and (2)
636
using essential, as well as secondary, semantic information tags.
637
In Experiment #1, only the essential semantic information tags were used: ‘subject’, ‘compliance
638
checking attribute’, ‘comparative relation’, ‘quantity value’, ‘quantity unit’, and ‘quantity
639
reference’. A subset of the gold standard (including logic clause elements corresponding to the
640
essential semantic information instances) was used as the gold standard for Experiment #1. A
641
total of 53 and 11 SeM and CoR rules, respectively, were developed.
642
In Experiment #2, both essential and secondary information tags were used. Figure 3 shows
643
examples of some of the information tags that were used. A total of 297 and 9 SeM and CoR
644
rules, respectively, were encoded. The gold standard of Experiment #2 (the full gold standard set)
645
contains 177% more logic clause elements than those in the gold standard of Experiment #1.
646
This shows that for quantitative requirements, the source text contains much secondary
647
information instances.
648
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
30
The SeM rules that were developed in the experiments are classified into four main types: simple
649
SeM rules, multiple action SeM rules, multiple condition SeM rules, and complex SeM rules. A
650
simple SeM rule is the simplest type where a strict SeM pattern directly maps to a logic clause.
651
For multiple action SeM rules, other actions (called “supportive actions”) such as “look-ahead
652
searching” and “look-back searching” are involved in addition to mapping SeM patterns to logic
653
clauses. For multiple condition SeM rules, the mapping from SeM patterns to logic clauses are
654
encoded in subrules to handle subtly different cases in rule conditions such as the existence/non-
655
existence status of certain information instances. A complex SeM rule is a combination of the
656
first three types of rules; it utilizes both supportive actions and subrules to support mappings
657
from SeM patterns to logic clauses.
658
The logic clauses generated from the SeM rules are classified into three main types: single
659
predicate logic clauses, multiple predicate logic clauses, and compound predicate logic clauses.
660
A single predicate logic clause includes only one single predicate (e.g., “space(Space)”). A
661
multiple predicate logic clause includes more than one predicate (e.g., “space(Space), area(Area),
662
has(Space, Area)”). A compound predicate logic clause has predicate(s) that embed other
663
predicate(s) as argument(s) (e.g., “greater_than_or_equal(T, quantity(71/2, inches))”).
664
665
Table 2 shows the patterns of the most applied SeM rules (i.e., rules applied at least three times)
666
in the experiments. The patterns of the rest of the applied SeM rules are shown in Table 3.
667
Insert Table 2
668
Insert Table 3
669
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
31
The overall performance results of Experiment #1 and Experiment #2 are summarized in Table 4
670
and Table 5, respectively.
671
Insert Table 4
672
Insert Table 5
673
A comparison between the results of Experiment #1 and those of Experiment #2 is summarized
674
in Table 6. The number of information tags in Experiment #2 increased 400% from that used in
675
Experiment #1. The increase in the number of SeM rules was of similar magnitude (460%).
676
Through analysis, the causes of this increase in the number of SeM rules were found to be: (1)
677
the use of more information tags increases the length of patterns in SeM rules, which in turn
678
increases the specificity of each pattern; and (2) the use of more information tags increases the
679
complexity of patterns in SeM rules, which in turn increases the possible number of patterns. In
680
contrast to SeM rules, the number of CoR rules decreased from Experiment #1 to Experiment #2.
681
This results from the use of more information tags, which leads to better distinguishable
682
information instances, and in turn leads to less conflicts between information instances.
683
The algorithms achieved 92.5% and 98.2%, 95.1% and 99.1%, and 93.8% and 98.6% overall
684
precision, recall, and F1 measure for Experiment #1 and Experiment #2, respectively. Both
685
precision and recall improve in Experiment #2, because the use of more information tags could:
686
(1) better distinguish and capture the variations in expressions; and (2) help define SeM rules
687
with more specificity in patterns. Based on the comparative analysis, the following conclusion
688
can be drawn: the use of more information tags helps in improving the performance of
689
information transformation.
690
Insert Table 6
691
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
32
The precisions of relation logic clause elements are lower than other precision and recall values
692
across Experiment #1 and Experiment #2. Through analysis, four main causes for this relatively
693
lower performance of precision (89.8% and 97.5% for Experiment #1 and Experiment #2,
694
respectively) of relation logic clause elements are recognized: (1) Structural ambiguity caused by
695
conjunctive terms: For example, in the following part of sentence, there are two possible
696
syntactic uses of “and” either linking “wall piers” and “such segments” or linking the
697
preceding clause and the following clause: “…shear wall segments provide lateral support to the
698
wall piers and such segments have a total stiffness…”. The ability of the SeM rules to handle
699
structural ambiguity is limited by the development text, which may lead to errors; (2) Incorrect
700
tagging during IE: For example, “professional” (in “registered design professional”) was
701
incorrectly tagged as an adjective instead of noun. This is due to the imperfection of state-of-the-
702
art POS tagging methods; (3) Errors due to morphological analysis (MA): MA was used for
703
improving the recall of semantic information instances by finding all forms of a term based on its
704
lexical form. However, while useful in this regard, MA also introduced false positive instances.
705
For example, as a result of MA, “supported” was stemmed into “support”, matched with the
706
concept “support” in the ontology, and as a result incorrectly recognized as an instance of
707
‘subject’; and (4) Errors caused by certain SeM rules: For example, an SeM rule selects the
708
immediate left neighbor of a preposition as the first argument of that preposition. In cases where
709
the immediate left neighbor of a preposition is not its real first argument, this SeM rule causes
710
errors. For example, in the following part of sentence, “gypsum concrete” was mistakenly
711
identified as the first argument rather than “clear span”: “clear span of the gypsum concrete
712
between supports”.
713
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
33
Analyzing other errors (other than those influencing precision of relation logic clause elements),
714
two additional causes of errors are recognized: (1) Missing tags in IE: For example, based on the
715
concepts in the ontology, “connection” should have been semantically-tagged as ‘subject’.
716
However, in a few instances, it was missing the ‘subject’ information tag. This is due to the
717
inherent errors in the NLP tools that were used (no existing NLP tool can achieve 100%
718
performance); and (2) Error in processing sentences with uncommon syntactic expression
719
structures: For example, in the part of sentence “which have been water soaked for at least 24
720
hours…”, “soaked” (‘compliance checking attribute) was not recognized because: (a) “soaked”
721
was not semantically-recognized because the ontology did not cover this concept, and (b) the
722
syntactic feature of “soaked” (i.e., past participle) was not a common syntactic expression for
723
compliance checking attribute (in contrast, noun is a common expression for ‘compliance
724
checking attribute’).
725
Limitations and Future Work
726
The experimental results show that the proposed approach is promising in automatically
727
transforming the extracted information instances into logic clauses for further compliance
728
reasoning. In spite of the high performance that was achieved (98.2%, 99.1%, and 98.6% for
729
precision, recall, and F1 measure, respectively), three main limitations of this work are
730
acknowledged, which the authors plan to address as part of their ongoing/future research. First,
731
the methodology was only tested on processing quantitative requirements. The types of semantic
732
patterns and conflicts in other types of requirements (e.g., existential requirements) may vary and,
733
thus, may lead to different performance results. Although the processing of other types of
734
requirements is expected to be less or equally complex than that of quantitative requirements
735
and thus is expected to have similar or better performance, in future work, the authors plan to test
736
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
34
the proposed methodology on other types of requirements (e.g., existential requirements) for
737
validation. Second, due to the large amount of manual effort required in developing a gold
738
standard, the proposed ITr algorithms were tested only on one chapter of IBC 2009. Similar high
739
performance is expected when testing on other chapters of IBC and on other regulatory
740
documents, since all regulatory documents share similarities in expressions. However, different
741
performance results might be obtained due to the possible variability of text across different
742
chapters or different regulatory documents. As such, in future work, the authors plan to test the
743
proposed ITr methodology on more chapters of IBC 2009 and on other types of regulatory
744
documents (e.g., environmental regulations). Third, the validation of the proposed ITr algorithms
745
was focused on precision and recall. At this stage, the computational efficiency of the proposed
746
algorithms wasnot evaluated, although it was taken into consideration when developing the
747
algorithms. For example, the more efficient and stable merge sort (rather than quick sort) was
748
used when a sorting algorithm was needed. In future work, the authors plan to perform
749
algorithm optimization to improve the computational efficiency of the proposed algorithms, if/as
750
necessary.
751
Contribution to the Body of Knowledge
752
This research contributes to the body of knowledge in four main ways. First, domain-specific,
753
semantic NLP-based information processing methods that can achieve full sentence processing
754
and information extraction (i.e., all terms of a sentence are processed), as opposed to partial
755
sentence processing and information extraction (i.e., only specific terms/concepts are
756
processed/extracted) are offered. Domain-specific semantics allow for analyzing complex
757
sentence structures that would otherwise be too complex and ambiguous for automated IE and
758
ITr, recognizing domain-specific text meaning, and in turn allowing for
759
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
35
processing/understandability of full sentences. Full sentence processing/understandability allows
760
for a deeper level of NLP, namely natural language understanding. Second, this research shows
761
that a hybrid approach that combines rule-based NLP methods and semantic NLP methods could
762
achieve high performance for the combination of IE and ITr from/of regulatory text, in spite of
763
the complexity inherent in natural language text. Domain-specific expert NLP knowledge
764
(encoded in the form of rules), along with domain knowledge (represented in the form of an
765
ontology), facilitates deep text processing/understandability. Previous work (Zhang and El-
766
Gohary 2013c) showed high performance for rule-based, semantic IE. This paper further shows
767
high performance for rule-based, semantic ITr. Third, a new context-aware and flexible way of
768
utilizing pattern-matching-rule-based methods through the use of context-aware semantic
769
mapping rules is offered. This way of utilizing pattern-matching-based rules captures the details
770
(in terms of the expression, language structure, etc.) of complex sentence components, in a
771
context-aware manner, and through flexible pattern lengths. Fourth, a new mechanism
772
(“consume and generate” mechanism) for processing and transforming complex regulatory text
773
into logic clauses is offered. The proposed mechanism follows the bottom-up method, which has
774
shown based on the experimental results to outperform the top-down method in ITr. The high
775
performance that the mechanism achieved verifies that the bottom-up method is suitable for such
776
ITr tasks.
777
From a practical perspective, this work is expected to have significant impacts on four main
778
levels. First, this work facilitates ACC in the construction domain. ACC could bring down the
779
time, cost, and errors of the checking process; promote compliance of construction projects to
780
various regulations (due to easier and more frequent checking); and encourage the adoption of
781
BIM in the AEC industry. Second, the novel IE and ITr methods and algorithms proposed in this
782
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
36
work could be adopted/applied to automate a variety of other tasks in the construction domain,
783
such as contract document analysis and construction accident record analysis. Third, the
784
proposed ITr methodology could be adopted/applied outside of the construction domain, which
785
would contribute to the general domain of natural language processing/understanding. Fourth,
786
the results of this research could ultimately lead to defining principles for the drafting of future
787
regulations in a manner to support ACC. For example, the use of uncommon expressions that
788
tend to cause processing errors could be avoided when drafting future regulations.
789
Conclusions
790
This paper presented a rule-based, semantic NLP methodology for automated information
791
transformation (ITr) of information instances, which were automatically extracted from
792
construction regulatory documents, into logic clauses. A set of semantic mapping (SeM) rules
793
and conflict resolution rules (CoR) are used in ITr. CoR rules resolve conflicts between
794
information instances, while SeM rules transform the information instances into logic clause
795
elements. The SeM rules use context-aware and flexible information patterns. Both syntactic and
796
semantic information tags are utilized in the patterns. Syntactic information tags (e.g., POS tags)
797
are generated using NLP techniques. A semantic model helps recognize the semantic information
798
tags of each extracted information instance. A “consume and generate” mechanism is proposed
799
to handle complex sentence components and execute the SeM rules. The ITr method, thus,
800
processes almost all terms of a sentence. Such full sentence processing enables deep NLP
801
towards natural language understanding.
802
The proposed ITr algorithms were tested in transforming information instances of quantitative
803
requirements, which were automatically extracted from Chapter 19 of IBC 2009, into logic
804
clauses. The transformation results were compared with a manually-developed gold-standard.
805
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
37
The results showed 98.2%, 99.1%, and 98.6% precision, recall, and F1 measure, respectively.
806
This high performance shows that the proposed ITr methodology is promising. Through error
807
analysis, the following six causes of errors were recognized: (1) missing tags in IE; (2) incorrect
808
tagging during IE; (3) errors in processing sentences with uncommon expression structures; (4)
809
errors due to morphological analysis; (5) errors caused by certain SeM rules; and (6) structural
810
ambiguity. In future work, the authors plan to further refine the proposed methodology to avoid
811
those causes of errors as much as possible, in an effort to further enhance the performance of
812
the ITr algorithms. Also, as part of the authors’ ongoing/future research, the proposed ITr
813
methodology will be tested on more chapters of building codes and on other types of
814
construction regulatory documents (e.g., environmental regulations). Similar high performance is
815
expected. However, variability in performance is possible due to differences in the characteristics
816
of the text across different chapters or documents.
817
Acknowledgement
818
The authors would like to thank the National Science Foundation (NSF). This material is based
819
upon work supported by NSF under Grant No. 1201170. Any opinions, findings, and conclusions
820
or recommendations expressed in this material are those of the authors and do not necessarily
821
reflect the views of NSF.
822
References
823
Abney, S. (1997). “Part-of-speech tagging and partial parsing.” Text, Speech and Language
824
Technology, 2(1997), 118-136.
825
Avolve Software Corporation. (2013). Electronic plan review for building and planning
826
departments. <http://www.avolvesoftware.com/index.php/solutions/building-departments/>
827
(Oct 3, 2013).
828
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
38
Breaux, T.D., and Anton, A.I. (2008). “Analyzing regulatory rules for privacy and security
829
requirements.” IEEE Transactions on Software Eng., 34(1), 5-20.
830
Caldas, C.H., and Soibelman, L. (2003). “Automating hierarchical document classification for
831
construction management information systems.” Autom. Constr., 12(2003), 395-406.
832
Califf, M. E., and Mooney, R. J. (2003). “Bottom-up relational learning of pattern matching rules
833
for information extraction.” J. Machine Learning Research, 4(2003), 177-210.
834
Cherpas, C. (1992). “Natural language processing, pragmatics, and verbal behavior.” The
835
Analysis of Verbal Behavior, 10, 135-147.
836
Crowston, K., Liu, X., Allen, E., and Heckman, R. (2010). “Machine learning and rule-based
837
automated coding of qualitative data.” Proc., 73rd ASIS&T Annual Meeting: Navigating
838
Streams in an Information Ecosystem, Association for Information Science and
839
Technology, Silver Spring, Maryland, 1-2.
840
Ding, L., Drogemuller, R., Rosenman, M., Marchant, D., and Gero, J. (2006). “Automating code
841
checking for building designs DesignCheck.” Clients Driving Innovation: Moving Ideas
842
into Practice, CRC for Construction Innovation, Brisbane, Australia, 1-16.
843
El-Gohary, N.M., and El-Diraby, T.E. (2010). “Domain ontology for processes in infrastructure
844
and construction.” J. Constr. Eng. Manage., 136(7), 730744.
845
Fenves, S.J., Gaylord, E.H., and Goel, S.K. (1969). “Decision table formulation of the 1969
846
AISC specification.” Civ. Eng. Studies: Structural Research Series, 347, University of
847
Illinois, Urbana, IL, 1-167
848
Garrett, J.H., Jr., and Fenves, S.J. (1987). “A knowledge-based standard processor for structural
849
component design.” Eng. with Comput., 2(4), 219-238.
850
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
39
Gildea, D., and Jurafsky, D. (2002). “Automatic labeling of semantic roles.” J. Comput. Linguist.,
851
28(3), 245-288.
852
Goh, O. S., Depickere, A., Fung, C.C., and Wong, K. W. (2006). “Topdown natural language
853
query approach for embodied conversational agent.” Proc., Intl. MultiConf. Eng. and
854
Comput. Sci. (IMECS 2006), The International Association of Engineers (IAENG), Hong
855
Kong, China.
856
Gruber, T.R. (1995). “Toward principles for the design of ontologies used for knowledge
857
sharing.” Intl. J. Human-Computer Studies, 43, 907-928.
858
Han, C.S., Kunz, J.C., and Law, K.H. (1998). “Client/server framework for online building code
859
checking.” J. Comput. Civ. Eng., 12(4), 181-194.
860
International Code Council (ICC). (2012). “International Code Council.” AEC3,
861
<http://www.aec3.com/en/5/5_013_ICC.htm> (Oct. 26, 2013).
862
Khemlani, L. (2005). “CORENET e-PlanCheck: Singapore's automated code checking system.”
863
AECbytes “Building the Future” Article,
864
<http://www.aecbytes.com/buildingthefuture/2005/CORENETePlanCheck.html> (Oct 26,
865
2013).
866
Kiyavitskaya, N., Zeni, N., Breaux, T.D., Anton, A.I., Cordy, J.R., Mich, L., and Mylopoulos, J.
867
(2008). “Automating the extraction of rights and obligations for regulatory compliance.”
868
Lecture Notes in Comput. Sci., 5231(2008), 154-168.
869
Marquez, L. (2000). “Machine learning and natural language processing.” Proc., “Aprendizaje
870
automatico aplicado al procesamiento del lenguaje natural”.
871
Nguyen, T. (2005). “Integrating building code compliance checking into a 3D CAD system.”
872
Proc., Intl. Conf. Comput. Civ. Eng., ASCE, Reston, VA, 1-12.
873
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
40
Niemeijer, R.A., Vries, B. de, and Beetz, J. (2009). “Check-mate: automatic constraint checking
874
of IFC models.” Managing IT in Construction/Managing Construction for Tomorrow, A
875
Dikbas, E Ergen & H Giritli (Eds.), CRC Press, London, 479-486.
876
Pocas Martins, J.P., and Abrantes, V. (2010). “Automated code-checking as a driver of BIM
877
adoption.” Intl. J. Housing Sci., 34(4), 286-294.
878
Pradhan, S., Ward, W., Hacioglu, K., Martin, J.H., and Jurafsky, D. (2004). “Shallow semantic
879
parsing using support vector machines.” Proc, NAACL-HLT, The Association for
880
Computational Linguistics, East Stroudsburg, PA, 233-240.
881
Roth, D., and Yih, W. (2004). “A linear programming formulation for global inference in natural
882
language tasks.” Proc., 2004 Conf. Comput. Natural Language Learning (CoNLL-2004),
883
SIGNLL, Boston, MA, 1-8.
884
Saint-Dizier, P. (1994). “Advanced logic programming for language processing.” Academic
885
Press, San Diego, CA.
886
Salama, D., and El-Gohary, N. (2013a). “Semantic text classification for supporting automated
887
compliance checking in construction”. J. Comput. Civ. Eng., Accepted and published online
888
ahead of print.
889
Salama, D., and El-Gohary, N. (2013b). “Automated compliance checking of construction
890
operation plans using a deontology for the construction domain.” J. Comput. Civ. Eng.,
891
27(6), 681-698.
892
Soysal, E., Cicekli, I., and Baykal, N. (2010). “Design and evaluation of an ontology based
893
information extraction system for radiological reports.” Comput. in Biology and Med.,
894
40(11-12), 900-911.
895
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
41
Sterling, L., and Shapiro, E. (1986). “The art of Prolog: advanced programming techniques.”
896
MIT Press, Cambridge, Massachusetts, London, England.
897
Tan, X., Hammad, A., and Fazio, P. (2010). “Automated code compliance checking for building
898
envelope design.” J. Comput. Civ. Eng., 24(2), 203-211.
899
Tierney, P.J. (2012). “A qualitative analysis framework using natural language processing and
900
graph theory.” The Intl. Review of Research in Open and Distance Learning, 13(5).
901
University of Sheffield. (2013). “General architecture for text engineering.” <http://gate.ac.uk/>
902
(Oct. 13, 2013).
903
Wyner, A., and Governatori, G. (2013). “A study on translating regulatory rules from natural
904
language to defeasible logic.” Proc., RuleML 2013: The 7th Intl. Web Rule Symposium,
905
Springer-Verlag, Berlin Heidelberg, Germany.
906
Wyner, A., and Peters, W. (2011). “On rule extraction from regulations.” Proc., JURIX 2011:
907
The 24th Intl. Conf. Legal Knowledge and Info. Systems, IOS Press, Amsterdam, The
908
Netherlands, 113-122.
909
Yin, S., and Fan, G. (2013). “Research of POS tagging rules mining algorithm.” Applied
910
Mechanics and Materials, 347 350(2013), 2836-2840.
911
Zhang, J., and El-Gohary, N.M. (2013a). “Information transformation and automated reasoning
912
for automated compliance checking in construction.” Proc., 2013 ASCE Intl. Workshop
913
Comput. in Civ. Eng., ASCE, Reston, VA, 701-708.
914
Zhang, J., and El-Gohary, N.M. (2013b). “Handling sentence complexity in information
915
extraction for automated compliance checking in construction.” Proc., CIB W78 2013,
916
Conseil International du Bâtiment (CIB), Rotterdam, The Netherlands.
917
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
42
Zhang, J., and El-Gohary, N. (2013c). “Semantic NLP-based information extraction from
918
construction regulatory documents for automated compliance checking.” J. Comput. Civ.
919
Eng., Accepted and published online ahead of print.
920
Zhong, B.T., Ding, L.Y., Luo, H.B., Zhou, Y., Hu, Y.Z., and Hu, H.M. (2012). “Ontology-based
921
semantic modeling of regulation constraint for automated construction quality compliance
922
checking.” Autom. Constr., 28, 58-70.
923
Zhou, N. (2012). “B-Prolog user’s manual (version 7.7): Prolog, agent, and constraint
924
programming.” Afany Software. <http://www.probp.com/manual/manual.html> (Nov. 19,
925
2012).
926
Zouaq, A. (2011). “An overview of shallow and deep natural language processing for ontology
927
learning.” Ontology Learning and Knowledge Discovery Using the Web: Challenges and
928
Recent Advances, IGI Global., Hershey, PA, 16-38.
929
930
931
932
933
934
935
936
937
938
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
43
939
940
941
Tables
942
Table 1: A Transformation Example
943
Requirement
Sentence
Courts shall not be less than 3 feet in width.
Source
Information
Tag
Subject
Compliance
Checking
Attribute
Comparative
Relation
Quantity
Unit
Quantity
Reference
Source
Information
Instance
court
width
not less than
feet
NA
Target
Logic
Clause
compliant_width_of_court(Court) :- width(Width), court(Court), has(Court,Width),
greater_than_or_equal(Width,quantity(3,feet)).
944
Table 2: Patterns of the Most Applied SeM Rules in the Experiments
945
SeM Rule Pattern
Action
Condition Case
Logic Clause Generated
SeM Rule Type
[a’ ‘s’ ‘cr] (a) ‘OF’
(b) [a’ ‘s’ ‘cr] (c)
a(A),c(C),has(C,A)
Simple
dpvr (a) [s cr] (b)
look-back search for attribute
or subject (s); look-back
search for negation (n)
n exists
s(S),b(B),not a(S,B)
Complex
n not exists
s(S),b(B),a(S,B)
c (a) v (b)
look-back search for attribute
or subject (s); look-ahead
search for unit or reference
(u); look-back search for
negation (n)
n exists
not a(S, quantity(b,u))
Complex
n not exists
a(S, quantity(b,u))
I s
skip
Multiple
action
c (a) ‘v’ (b) ‘u’ (c)
‘IN’ (d) ‘s’ (e)
look-back search for attribute
or subject (s)
distance(Distance),s(S),e(E),
d(S,E,Distance),a(Distance,
quantity(b,c))
Multiple
action
[‘a’ ‘s’ ‘cr’] (a) ‘CC’
(b) [‘a’ ‘s’ ‘cr’] (c)
(a(A);c(A))
Simple
[‘VB’ ^ ‘be’] (a) ‘IN’
(b) [‘cr’ ‘a’ ‘s’] (c)
look-back search for subject or
attribute (s)
s(S),c(C),b(S,C)
Multiple
action
[a’ ‘s’ ‘cr] (a) IN
a(A),c(C),b(A,C)
Simple
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
44
(b) [a’ ‘s cr] (c)
Except
mark the beginning of
exception
Multiple
action
n (a) c (b) v (c)
u (d)
look-back search for attribute
or subject (s)
s(S),not b(S,quantity(c,d))
Multiple
action
[a s] (a) OF (b)
v (c) [u a] (d)
pattern preceded by
[‘a’ ‘s’ ‘cr’] (e)
[‘Has’ ‘NoHas’
‘IN’ OF ^
between] (f)
a(A),e(E),equal_to(E,
quantity(c,d))
Multiple
condition
otherwise
a(A),equal_to(A,
quantity(c,d))
VBP (a) VBN (b)
look-back search for attribute
or subject (s)
b(S)
Multiple
action
I CC
skip
Multiple
action
s (a) MD (b) Has
(c) a’ (d)
look-back search for attribute
or subject (s)
pattern preceded by
IN
s(S),d(D),has(S,D)
Complex
otherwise
a(A),d(D),has(A,D)
‘TO’ (a) ‘VB’ (b) [‘s’
‘cr’ ‘a’] (c)
look-back search for attribute
or subject (s)
s not exists
c(C),a_b(C)
Complex
(1) ‘’: A pair of single quotes encloses information tags
946
(2) ^: A caret separates optional information tags from exceptions
947
(3) (a) , (b) , (c) , etc., show the mapping of components (in SeM patterns) to logic clause
948
elements (in generated logic clauses), where an upper case represents a variable
949
(4) Contents in the “logic clause generated” column are case-sensitive
950
951
952
Table 3: Patterns of the Rest of the SeM Rules Applied in the Experiments
953
SeM Rule Pattern
[‘a’ ‘s’ ‘cr’] ‘MD’ ‘n’ ‘VB’ ‘c’ ‘v’ ‘u’
‘VBP’ ‘dpvr’ ‘VB’
‘s’ ‘JJ’ ‘n’ ‘c’ ‘v’ ‘u’
‘n’ ‘c’ ‘s’
‘IN’ ‘ea’ [‘v’ ‘CD’] ‘u’ ‘OF’ ‘s’
[‘s’ ‘cr’] ‘VBD’ [‘cr’ ‘s’]
‘I’ ‘CC’ ‘n’ ‘C’ ‘v’ ‘u’
‘IN’ ‘VBG’ [‘cr’ ‘s’]
‘JJ’ ‘IN’ ‘c’ ‘v’ [‘u’ ‘cr’]
[‘s’ ‘cr’] ‘VBP’ [‘VBN’ ‘JJ’]
‘VB’ ‘IN’ ‘c’ ‘v’ [‘cr’ ‘s’]
‘dpvr’ ‘v’ ‘u’
‘s’ ‘MD’ ‘VB’ ‘dpvr’ [‘VBZ’ ‘cr’ ‘VB’]
‘RB’ ‘TO’ [‘s’ ‘cr’]
‘CC’ ‘v’ ‘u’ ‘IN’ ‘a’
‘MD’ ‘VB’ ‘VBN’
TO’ [‘s’ ‘cr’]
‘a’ ‘OF’ ‘v’ ‘u’ ‘by’ ‘v’ ‘u’
45
‘s’ ‘MD’ ‘n’ ‘VB’ ‘dpvr’
[‘cr’ ‘s’ ‘a’] [‘OF’ ‘IN’ ‘Has’ ‘NoHas’ ^ ‘for’] ‘s’ ‘IN’ ‘s’
[‘s’ ‘a’ ‘cr’] ‘I’ ‘VBG’ [‘cr’ ‘a’ ‘s’] ‘I’
‘MD’ ‘VB’ [‘a’ ‘s’ ‘cr’]
‘JJ’ ‘CC’ ‘JJR’ ‘s’
‘n’ ‘c’ ‘v’
‘s’ ‘WDT’ ‘VBP’ ‘cr’
‘n’ ‘c’ ‘CD’
‘VBG’ ‘cr’ ‘VBP’ ‘VBN’
‘v’ [‘s’ ‘cr’]
‘MD’ ‘VB’ ‘v’ ‘u’
‘s’ ‘VBN’
‘c’ ‘v’ ‘ea’ [‘cr’ ‘s’]
‘JJR’ ‘IN’
‘IN’ ‘JJ’ ‘CC’ ‘s’
‘TO’ [‘s’ ‘cs’]
[‘s’ ‘cr’] ‘with’ ‘a’
‘Except’ ‘IN’
‘n’ ‘c’ ‘v’ [‘cr’ ‘s’]
‘rv’ [‘a’]
‘JJR’ ‘IN’ ‘v’ ‘u’
‘VBZ’ ‘dpvr’
‘s’ ‘Has’ ‘a’ ‘OF’ ‘c’ ‘v’ ‘u’
‘VB’ [‘cr’ ‘a’ ‘s’]
‘s’ ‘MD’ ‘VB’ ‘OF’
‘IN’ [‘cr’ ‘a’ ‘s’]
‘MD’ ‘VB’ ‘dpvr’ ‘s’
[‘u’ ‘JJR’] [^ ‘stories’]
[‘cr’ ‘a’ ‘s’] ‘MD’ ‘VB’ [‘cr’ ‘a’ ‘s’]
‘I’ ‘a’
‘s’ ‘MD’ ‘Has’ ‘s’
‘I’ ‘VBD’
‘cs’ ‘MD’ ‘Has’ ‘s’
‘I’ ‘JJ’
‘v’ ‘u’ ‘CC’ ‘JJR’
‘VBD’ ‘I’
‘s’ ‘MD’ ‘VB’ ‘dpvr’
954
Table 4: Experimental Results Using Essential Information Tags Only
955
Concepts
Relations
Total
Number of logic clause elements in gold standard
334
749
1083
Total number of logic clause elements generated
328
786
1114
Number of logic clause elements correctly generated
324
706
1030
Precision
0.988
0.898
0.925
Recall
0.970
0.943
0.951
F1 measure
0.979
0.920
0.938
956
Table 5: Experimental Results Using Both Essential and Secondary Information Tags
957
Concepts
Relations
Total
Number of logic clause elements in gold standard
570
1349
1919
Total number of logic clause elements generated
569
1367
1936
Number of logic clause elements correctly generated
568
1333
1901
Precision
0.998
0.975
0.982
Recall
0.996
0.988
0.991
F1 measure
0.997
0.982
0.986
958
Table 6: Comparative Summary of Experiment #1 and Experiment #2
959
Experiment #1
Experiment #2
Increase
Number of information tags used
8
40
+ 400%
Number of semantic mapping rules used
53
297
+ 460%
Number of conflict resolution rules used
11
9
- 18%
Number of logic clause elements built
1114
1936
174%
Precision
0.925
0.982
6%
Recall
0.951
0.991
4%
F1 Measure
0.938
0.986
5%
960
961
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
46
Figures
962
Figure 1. Proposed approach for automated rule extraction
963
964
Figure 2. Proposed information transformation methodology
965
966
967
968
969
970
971
972
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
47
Figure 3. Sample sentence with information tags
973
974
Figure 4. Illustration of top-down method and bottom-up method
975
976
Figure 5. Example illustrating logic-based information representation and reasoning
977
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
48
978
979
980
981
982
983
984
985
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
49
Figure 6. Semantic information tags
986
987
988
989
990
991
992
993
994
995
996
997
998
999
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
50
Figure 7. Example illustrating the processing of a sample sentence: (a) original sentence; (b) sentence
1000
tagged with information tags; (c) information instance tuple list; (d) information instance tuple list after
1001
applying conflict resolution rules; (e) logic clause generated by consume and generate mechanism
1002
1003
1004
1005
1006
1007
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
51
Figure 8. Pseudocode for main algorithm
1008
1009
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
52
Figure 9. Pseudocode for consume and generate mechanism
1010
1011
1012
1013
The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427
Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in
Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.
... Information extraction has been applied in many domains including building code information extraction (Zhou and El-Gohary 2017;Zhang and El-Gohary 2016;Zhang and El-Gohary 2015), bridge inspection report information extraction (Liu and El-Gohary 2017;Liu and El-Gohary 2016), and clinical notes information extraction (Yehia et al. 2019). For example, Zhang and El-Gohary (2016;2015) proposed semantic natural language processing (NLP)-based information extraction algorithms for construction regulatory documents to support automated compliance checking. ...
... Information extraction has been applied in many domains including building code information extraction (Zhou and El-Gohary 2017;Zhang and El-Gohary 2016;Zhang and El-Gohary 2015), bridge inspection report information extraction (Liu and El-Gohary 2017;Liu and El-Gohary 2016), and clinical notes information extraction (Yehia et al. 2019). For example, Zhang and El-Gohary (2016;2015) proposed semantic natural language processing (NLP)-based information extraction algorithms for construction regulatory documents to support automated compliance checking. Their proposed methodology leveraged phrase structure grammar (PSG)-based phrasal tags as the feature for syntactic analysis, which reduced the number of rules required in the extraction algorithm. ...
... Previous studies usually relied on manually established rules for analyzing regulations and safety incident reports (Zhong et al. 2012), but these methods are time-consuming and have low portability (Zhang and El-Gohary 2016;Zhou et al. 2022). Later, various automated and semiautomated methods (Zhang and El-Gohary 2015) were reported for regulatory compliance checks, transforming information into logic clauses for automated reasoning, with ontology and rule-based information extraction (Ren and Zhang 2021). Further research was reported to identify knowledge patterns (Zhong et al. 2022), match material pricing for cost estimation (Akanbi and Zhang 2020), and develop compliance checking systems . ...
Article
The explicit safety knowledge contained in regulations in the form of texts and tables is crucial for construction safety management. However, the presence of rich semantic content within texts and the intricate layout of complex tables makes domain information extraction challenging. Therefore, this research proposed a hybrid approach to map safety knowledge graphs by automatically extracting information from both texts and tables in a scenario-oriented manner, combining rules and deep learning methods to achieve a balance between scene applicability and method flexibility. Furthermore, metrics from social network analysis (SNA) were applied to evaluate and verify the quality of the constructed knowledge graph. For extracting semantic information from text, the proposed approach supplemented the semantics information of the sentence and balanced the granularity of knowledge by combining the BERT-BiLSTM-CRF-based named entity recognition (NER) model and semantic role labeling (SRL)-based information extraction model. For irregular tables, a unified automatic extraction method was developed to process nested tables without preprocessing. The experiment constructed a comprehensive and scenario-oriented knowledge graph with 907 nodes, and showed high precision and recall for texts (89.37%, 85.42%) and tables (97.11%, 85.22%) on the test data. SNA results showed the proposed method ensured information richness and structural complexity. Practical Applications: The construction safety knowledge graph constructed in this research offers three significant practical advantages. First, the proposed framework provides a solution for automatically integrating regulations into a knowledge graph with rich semantics and comprehensive information. Considering both sentence semantics and entity granularity enhances the application of Chinese regulatory clauses to specific construction scenarios. Second, the knowledge graph incorporated both textual semantics and tabular data, which assists managers in querying more accurate and comprehensive safety requirements. The comprehensive knowledge graph allows managers to quickly locate the necessary construction requirements on a larger scale and make more comprehensive and accurate construction decisions, effectively improving work efficiency and decision-making quality. Third, metrics from SNA suggested that the proposed method maintained the amount and diversity of regulatory information, while strengthening the compactness of the community structure and providing specific and clear requirements for the construction situation, operation procedures, and threshold definition. As a result, it is easier for managers to understand and process the safety information, perform construction operations in accordance with regulatory requirements, ensure the compliance of the operation, and further improve construction safety.
... The intricacy of regulatory compliance is a significant barrier within the building business. It is of utmost importance for construction enterprises to ensure adherence to a multitude of legislative requirements and performance-based rules (Zhang and El-Gohary, 2016). The conventional approaches to regulatory compliance verification are characterized by their tendency to consume significant amounts of time, their susceptibility to errors, and their heavy reliance on resources. ...
Article
Full-text available
Large Language Models (LLMs) trained on large data sets came into prominence in 2018 after Google introduced BERT. Subsequently, different LLMs such as GPT models from OpenAI have been released. These models perform well on diverse tasks and have been gaining widespread applications in fields such as business and education. However, little is known about the opportunities and challenges of using LLMs in the construction industry. Thus, this study aims to assess GPT models in the construction industry. A critical review, expert discussion and case study validation are employed to achieve the study's objectives. The findings revealed opportunities for GPT models throughout the project lifecycle. The challenges of leveraging GPT models are highlighted and a use case prototype is developed for materials selection and optimization. The findings of the study would be of benefit to researchers, practitioners and stakeholders, as it presents research vistas for LLMs in the construction industry.
Article
Full-text available
Az építőipar a munkaerő hiánya és az egyre fokozódó minőségi elvárások miatt a hagyományos, jellemzően emberi erőforrást alkalmazó vagy emberek által közvetlenül működtetett technológiák irányából apró lépésenként az automatizált technológiák irányába fordul. Az ezzel együtt járó változás csak úgy lehet zökkenőmentes, ha az építőipar résztvevői aktív részesei a változási folyamatnak. A cikk az építőipar fejlődési irányait, annak problematikáját és lehetőségeit kívánja bemutatni a területtel kapcsolatos kutatások és a már alkalmazott technológiai megoldások elemzésével a közeljövőben lehetséges változások, további lehetőségek, illetve problémák feltérképezésére és megvilágítására törekedve.
Article
Full-text available
Interpreting regulatory documents or building codes into computer-processable formats is essential for the intelligent design and construction of buildings and infrastructures. Although automated rule interpretation (ARI) methods have been investigated for years, most of them are highly dependent on the early and manual filtering of interpretable clauses from a building code. While few of them considered machine interpretability, which represents the potential to be transformed into a computer-processable format, from both clause-and document-level. Therefore, this research aims to propose a novel approach to automatically evaluate and enhance the machine interpretability of single clauses and building codes. First, a few categories are introduced to classify each clause in a building code considering the requirements for rule interpretation, and a dataset is developed for model training. Then, an efficient text classification model is developed based on a pretrained domain-specific language model and transfer learning techniques. Finally, a quantitative evaluation method is proposed to assess the overall interpretability of building codes. Experiments show that the proposed text classification algorithm outperforms the existing CNN-or RNN-based methods, by improving the F1-score from 72.16% to 93.60%. It is also illustrated that the proposed classification method can enhance downstream ARI methods with an improvement of 4%. Furthermore, analysis of more than 150 building codes in China showed that their average interpretability is only 34.40%, which implies that it is still difficult to fully transform an entire regulatory documents into computer-processable formats. It is also argued that the interpretability of building codes should be further improved both from the human side (considering certain constraints when writing building codes) and the machine side (developing more powerful algorithms, tools, etc.)..
Article
Full-text available
p style="margin-bottom: 0in; line-height: 200%;">This paper introduces a method of extending natural language-based processing of qualitative data analysis with the use of a very quantitative tool—graph theory. It is not an attempt to convert qualitative research to a positivist approach with a mathematical black box, nor is it a “graphical solution”. Rather, it is a method to help qualitative researchers, especially those with limited experience, to discover and tease out what lies within the data. A quick review of coding is followed by basic explanations of natural language processing, artificial intelligence, and graph theory to help with understanding the method. The process described herein is limited by neither the size of the data set nor the domain in which it is applied. It has the potential to substantially reduce the amount of time required to analyze qualitative data and to assist in the discovery of themes that might not have otherwise been detected. </p
Article
Full-text available
Legally binding regulations are expressed in natural language. Yet, we cannot formally or automatically reason with regulations in that form. Defeasible Logic has been used to formally represent the semantic interpretation of regulations; such representations may provide the abstract specification for a machinereadable and processable representation as in LegalRuleML. However, manual translation is prohibitively costly in terms of time, labour, and knowledge. The paper discusses work in progress using the state-of-the-art in automatic translation of a sample of regulatory clauses to a machine readable formal representation and a comparison to correlated Defeasible Logic representations. It outlines some key problems and proposes tasks to address the problems.
Article
Full-text available
Automated regulatory compliance checking requires automated extraction of requirements from regulatory textual documents and their formalization in a computer-processable rule representation. Such information extraction (IE) is a challenging task that requires complex analysis and processing of text. Natural language processing (NLP) aims to enable computers to process natural language text in a human-like manner. This paper proposes a semantic, rule-based NLP approach for automated IE from construction regulatory documents. The proposed approach uses a set of pattern-matching-based IE rules and conflict resolution (CR) rules in IE. A variety of syntactic (syntax/grammar-related) and semantic (meaning/context-related) text features are used in the patterns of the IE and CR rules. Phrase structure grammar (PSG)-based phrasal tags and separation and sequencing of semantic information elements are proposed and used to reduce the number of needed patterns. An ontology is used to aid in the recognition of semantic text features (concepts and relations). The proposed IE algorithms were tested in extracting quantitative requirements from the 2009 International Building Code and achieved 0.969 and 0.944 precision and recall, respectively.
Article
With the scarcity of land supply, complex high-rise buildings of more than 50 storeys Information management in the construction industry is inefficient when compared with other industrial activities. Unlike other productive activities, the construction industry is yet to develop standard formats for the representation of its products, which would allow its participants to communicate efficiently and, in some cases, automatically. Several different information models-(BIM) that represent building products partially or as a whole have been developed over the last decades. Their adoption by the community of users has been, however, scarce. It is believed that the dissemination and adoption of these models throughout the construction industry is hindered by a cooperation problem: the cumulative benefits derived from widespread BIM adoption are clearly larger than those that can be achieved through individual adoption, while the initial direct and indirect costs are considerable. The incentives for single users to change work their processes are therefore modest. In this context, automated code checking performed upon designs that follow standard representation formats is regarded not as an end in itself, but rather as a demonstration of the immediate benefits that can be obtained by the users who voluntarily adopt this kind of information technology. In this paper, an information model and an application developed at FEUP are briefly presented. These tools perform automated code-checking of domestic water systems for compliance with the main national regulations. Automated code-checking should not only provide advantages due to simplified work processes, but it should also motivate users to adopt building information models, especially in the early stages of the construction process.
Article
This chapter gives an overview over the state-of-the-art in natural language processing for ontology learning. It presents two main NLP techniques for knowledge extraction from text, namely shallow techniques and deep techniques, and explains their usefulness for each step of the ontology learning process. The chapter also advocates the interest of deeper semantic analysis methods for ontology learning. In fact, there have been very few attempts to create ontologies using deep NLP. After a brief introduction to the main semantic analysis approaches, the chapter focuses on lexico-syntactic patterns based on dependency grammars and explains how these patterns can be considered as a step towards deeper semantic analysis. Finally, the chapter addresses the "ontologization" task that is the ability to filter important concepts and relationships among the mass of extracted knowledge.
Article
Cyber bullying is a rapidly burgeoning phenomenon in to-days world dominated by the Internet. From every major incident happening around the world to meager day-to-day activities of an individual is posted on social media. Ergo, Internet has now become an essentiality that is indispensable. Though this seems intriguing, however, it has led to the advent of cyber bullying. Social networking sites provide an easy platform for the cyber bullies to identify and victimize other users. Cyber bullies may make use of victims personal data(e.g. real name, home address) to impersonate them, or by creating fake accounts in social networking sites that defames, discredits or ridicules them. Due to the anonymity of the Cyber bullies it becomes increasingly difficult for the o ender to be caught and punished for their behavior. This paper proposes a system which identifies posts which are aimed at hurting the sentiments of other users and makes the user to rethink and hence refrain from posting the same. This paper also provides an effective algorithm that identifies and reduces the spam content in the users post/tweet.
Article
Automated regulatory and contractual compliance checking requires automated rule extraction from regulatory and contractual textual documents (e.g., contract specifications). Automated rule extraction is a challenging task that requires complex processing of text. In the proposed automated compliance checking (ACC) approach, the first step in automating the rule extraction process is automatically classifying the different documents and parts of documents (e.g., contract clauses) into predefined categories (environmental, safety, health, etc.) for preparing it for further text analysis and rule extraction. These categories are defined in a semantic model for normative reasoning. This paper presents a semantic, machine learning-based text classification algorithm for classifying clauses and subclauses of general conditions for supporting ACC in construction. The multilabel classification problem was transformed into a set of binary classification problems. Different machine learning algorithms, text preprocessing techniques, methods of text feature scoring, methods of feature weighting, and feature sizes were implemented and evaluated at different thresholds. The developed classifier achieved 100 and 96% recall and precision, respectively, on the testing data. (C) 2014 American Society of Civil Engineers.
Article
Automated compliance checking (ACC) in the construction domain continues to be a challenge. Current ACC systems do not provide the level of knowledge representation and reasoning that is needed to efficiently interpret applicable norms (e.g.,laws, regulations, contractual requirements, advisory practices) and check conformance of designs and operations to those interpretations. In this paper, the authors explore a new approach to ACC and propose to apply theoretical and computational developments in the fields of deontology, deontic logic, and natural language processing to the problem of compliance checking in construction. Deontology is a theory of rights and obligations, and deontic logic is a branch of modal logic that deals with obligations, prohibitions, and permissions. This paper focuses on presenting a deontology for ACC in construction. The deontic model is composed of a hierarchy of normative concepts, interconcept relations, and deontic axioms (rules represented using deontic logic). The deontology was evaluated through formal competency questions, automated consistency checking, automated redundancy checking, expert evaluation, and application-oriented evaluation. The deontic model was manually applied in checking the compliance of storm-water pollution prevention plans with applicable norms. (C) 2013 American Society of Civil Engineers.