ArticlePDF Available

Automated Information Transformation for Automated Regulatory Compliance Checking in Construction

August 2015
Journal of Computing in Civil Engineering 29(4):B4015001

August 2015
29(4):B4015001

DOI:10.1061/(ASCE)CP.1943-5487.0000427

Authors:

Purdue University

To fully automate regulatory compliance checking of construction projects, regulatory requirements need to be automatically extracted from various construction regulatory documents and then transformed into a formalized format that enables automated reasoning. To address this need, the authors propose an approach for automatically extracting information from construction regulatory textual documents and transforming them into logic clauses that could be directly used for automated reasoning. This paper focuses on presenting the proposed information transformation (ITr) methodology and the corresponding algorithms. The proposed ITr methodology utilizes a rule-based, semantic natural language processing (NLP) approach. A set of semantic mapping (SeM) rules and conflict resolution (CoR) rules are used to enable the automation of the transformation process. Several syntactic text features (captured using NLP techniques) and semantic text features (captured using an ontology) are used in the SeM and CoR rules. A bottom-up method is leveraged to handle complex sentence components. A consume and generate mechanism is proposed to implement the bottom-up method and execute the SeM rules. The proposed ITr algorithms were tested in transforming information instances of quantitative requirements, which were automatically extracted from the International Building Code 2009, into logic clauses. The algorithms achieved 98.2 and 99.1% precision and recall, respectively, on the testing data.

Content uploaded by Jiansong Zhang

Content may be subject to copyright.

1 Graduate Student, Dept. of Civil and Environmental Engineering, Univ. of Illinois at Urbana-Champaign,

205 N. Mathews Ave., Urbana, IL 61801.

2 Assistant Professor, Dept. of Civil and Environmental Engineering, Univ. of Illinois at Urbana-

Champaign, 205 N. Mathews Ave., Urbana, IL 61801 (corresponding author). E-

mail:gohary@illinois.edu; Tel: +1-217-333-6620; Fax: +1-217- 265-8039.

Automated Information Transformation for Automated Regulatory Compliance Checking

in Construction

Jiansong Zhang1; and Nora M. El-Gohary, A.M.ASCE2

Abstract

To fully automate regulatory compliance checking of construction projects, we need to

automatically extract regulatory requirements from various construction regulatory documents,

and transform these requirements into a formalized format that enables automated reasoning. To

address this need, the authors propose an approach for automatically extracting information from

construction regulatory textual documents and transforming them into logic clauses that could be

directly used for automated reasoning. This paper focuses on presenting the proposed

information transformation (ITr) methodology and the corresponding algorithms. The proposed

ITr methodology utilizes a rule-based, semantic natural language processing (NLP) approach. A

set of semantic mapping (SeM) rules and conflict resolution (CoR) rules are used to enable the

automation of the transformation process. Several syntactic text features (captured using NLP

techniques) and semantic text features (captured using an ontology) are used in the SeM and

CoR rules. A bottom-up method is leveraged to handle complex sentence components. A

“consume and generate” mechanism is proposed to implement the bottom-up method and

execute the SeM rules. The proposed ITr algorithms were tested in transforming information

instances of quantitative requirements, which were automatically extracted from the International

Building Code 2009, into logic clauses. The algorithms achieved 98.2% and 99.1% precision and

recall, respectively, on the testing data.

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

CE Database subject headings: Project management; Construction management; Information

management; Computer applications; Artificial intelligence.

Author keywords: Automated compliance checking; Automated information extraction;

Automated information transformation; Natural language processing; Semantic systems;

Automated construction management systems.

Introduction

Construction projects must comply with a host of regulations. The manual process of compliance

checking is, thus, time-consuming, costly, and error-prone (Han et al. 1998; Nguyen 2005;

Zhang and El-Gohary 2013c). Automated compliance checking (ACC), as an alternative to

manual checking, is expected to reduce the time, cost, and errors of compliance checking (CC)

(Tan et al. 2010; Salama and El-Gohary 2013b). In addition, ACC has many other potential

benefits, such as: (1) allowing earlier identification of potential non-compliance instances, which

could save significant time and cost caused by design modification and/or rework (Ding et al.

2006); (2) promoting the adoption of building information modeling (BIM) and increasing the

cumulative benefits of adopting BIM, since BIM would enable ACC (Pocas Martins and

Abrantes 2010); (3) enabling more efficient incorporation of stakeholder input into project

design and exploration of what-if design scenarios, since a designer would be better able to

experiment with different design options and check their compliance in a more time-efficient

manner (Niemeijer et al. 2009); and (4) reducing violations of regulations due to easier and more

frequent CC (Zhong et al. 2012).

Due to the many anticipated benefits of ACC, many efforts were undertaken in the area of ACC

in construction. The start of these efforts could be dated back to the 1960s, when Fenves et al.

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

(1969) formalized the American Institute of Steel Construction (AISC) specifications into

decision tables. These efforts took various approaches to ACC and focused on various ACC

purposes (or subdomains). For example, Garrett and Fenves (1987) proposed a strategy to

represent design standards using information networks and represent design component

properties using data items for ACC of structural designs; Ding et al. (2006) proposed an

approach to represent building codes using object-based rules and represent designs using an

Industry Foundation Classes (IFC)-based internal model for ACC of accessibility regulations;

Tan et al. (2010) proposed an approach to represent building codes and design regulations using

decision tables and incorporate simulation results in building information models for ACC of

building envelope design; the CORENET (Construction and Real Estate NETwork) project of

Singapore (Khemlani 2005) used an approach to represent design information using semantic

objects in the FORNAX library (i.e., a C++ library) and represent regulatory rules using

properties and functions in FORNAX objects for ACC of building control regulations, barrier

free access, and fire code, etc.; and the SMARTcodes project (ICC 2012) of the International

Code Council (ICC) used an approach to represent ICC codes in computer-processable tuple

format and represent designs using an IFC-based model for ACC of designs with ICC codes.

These efforts have all been very important in supporting ACC, and have shown the possibilities

of ACC through different system designs and implementations. However, despite their

importance, these efforts are limited in their automation capability; existing ACC efforts/systems

still require manual effort for the extraction of regulatory requirements from regulatory

documents and encoding them in a computer-processable format (Zhong et al. 2012; Zhang and

El-Gohary 2013c). To achieve full automation of ACC, this extraction and encoding process

needs to be fully automated.

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

To address this gap, the authors are proposing a new approach for automated rule extraction and

formalization for supporting ACC (Zhang and El-Gohary 2013a; Zhang and El-Gohary 2013b).

The approach utilizes semantic modeling and semantic Natural Language Processing (NLP)

techniques (for both information extraction and information transformation) to facilitate

automated textual regulatory document analysis (e.g., code analysis) and processing for

extracting requirements/rules from these documents and formalizing these requirements/rules in

a meaning-rich, computer-processable format. The approach involves developing a set of

algorithms and combining them into one computational platform: (1) machine-learning-based

algorithms for text classification (TC), (2) rule-based, semantic NLP algorithms for information

extraction (IE), and (3) rule-based, semantic NLP algorithms for information transformation

(ITr). This paper focuses on presenting the methodology and algorithms for ITr.

Proposed Approach for Automated Rule Extraction and Formalization for Automated

Compliance Checking

Proposed Approach

A five-phase, iterative approach for automatically extracting regulatory requirements/rules from

textual regulatory documents and formalizing these requirements in a logic format for further

automated reasoning is proposed (Figure 1). The five phases are: text classification (TC),

information extraction (IE), information transformation (ITr), implementation, and evaluation.

TC, IE, and ITr are the main processing phases.

Insert Figure 1

TC recognizes relevant sentences in a regulatory text corpus. Relevant sentences are the

sentences that contain the types of requirements that are relevant for an ACC scenario (e.g.,

environmental requirements in the scenario of environmental CC). Target information in those

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

relevant sentences are extracted and transformed in later IE and ITr processes. The TC process,

thus, filters out irrelevant sentences, thereby saving unnecessary processing of irrelevant

sentences. Such filtering also avoids unnecessary extraction and transformation errors that may

be caused by the processing of irrelevant sentences. The presentation of the TC algorithms and

results is outside the scope of this paper. For further details on the authors’ work in TC, the

reader is referred to Salama and El-Gohary (2013a).

IE recognizes the words and phrases in the relevant sentences that carry target information,

extracts information from these words/phrases, and labels them with pre-defined information

tags. An information tag is a symbol/name indicating a certain type of meaning. For example, the

information tag ‘subject’ carries the semantic meaning that the information instance is a “thing”

(e.g., building object) that is subject to a particular regulation or norm; while the information tag

100

‘JJ’ carries the syntactic meaning that the information instance is an adjective that describes a

101

noun as a modifier. Target information is the information needed to check a specific type of

102

regulatory requirement. For example, for quantitative requirements, the quantified

103

values/measurements of specific properties/attributes are target information. For IE by itself, a

104

seven-phase, iterative methodology is utilized. In the IE methodology, a set of pattern-matching-

105

based IE rules are used. Both syntactic (i.e., related to syntax and grammar, such as part-of-

106

speech (POS) tags) and semantic (i.e., related to context and meaning, such as ontology concepts

107

and relations) text features are used in the IE rules. The presentation of the IE algorithms and

108

results is outside the scope of this paper. For further details on the authors’ work in the area of IE,

109

the reader is referred to Zhang and El-Gohary (2013c).

110

ITr takes the extracted information instances and transforms them into logic clauses (i.e., logic

111

statements that can be further used in logic programs) using a set of pattern-matching-based rules.

112

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Two types of rules are utilized for ITr: semantic mapping (SeM) rules and conflict resolution

113

(CoR) rules. Several syntactic and semantic text features are used in the rules. A bottom-up

114

method is utilized to handle complex sentence components. A “consume and generate”

115

mechanism is proposed to implement the bottom-up method and execute the SeM rules. The

116

following sections present and discuss the proposed ITr methodology in more detail. The

117

experimental implementation of the methodology in processing quantitative requirements from

118

Chapter 19 of the International Building Code (IBC) 2009 is also presented.

119

Comparison to the State-of-the-Art

120

In recent years, a number of research efforts, in domains such as software engineering (Breaux

121

and Anton 2008; Kiyavitskaya et al. 2008) and legal compliance (Wyner and Peters 2011), have

122

been studying the extraction of regulatory rules from textual documents. Most of these efforts (1)

123

require manual annotation or mark-up of textual documents; and (2) aim at processing text at a

124

coarser granularity level, i.e., process text into text segments rather than term-level

125

concepts/relations. On the other hand, the proposed approach (1) does not require manual

126

annotation or mark-up of textual documents; and (2) aims at processing text into concepts and

127

relations at the term level (i.e., aims at performing a deeper level of NLP). To the best of the

128

authors’ knowledge, the only work that has taken a somewhat similar approach to the proposed

129

one– since it also does not require manual annotation/mark-up and aims at term-level processing,

130

in addition to utilizing a semantic and logic-based approach – is that by Wyner and Governatori

131

(2013). Wyner and Governatori (2013) have conceptually explored and analyzed the use of

132

semantic parsing and defeasible logic for regulatory rule representation. In comparison, the

133

proposed approach (1) utilizes both syntactic and semantic text features in an integrated way

134

rather than utilizing only semantic information: the use of syntactic text features in addition to

135

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

semantic ones allows for handling more complex expressions, (2) uses a domain ontology for

136

capturing domain-specific semantic information rather than using generic semantic information

137

produced through generic semantic parsing: capturing and using semantic text features based on

138

domain-specific meaning allows for unambiguous interpretation of concepts/relations/terms (e.g.,

139

“bridge” as an infrastructure instead of the card game) and identification of implicit semantic

140

relations (e.g., “fly ash” is a type of “cementitious material”), (3) uses first order logic (FOL)

141

rather than defeasible logic: FOL is the most widely used in automated reasoning and has been

142

extensively verified for expressivity and simplicity, and (4) has advanced to the stages of

143

implementation, testing, and evaluation: this allows for assessing the validity of the proposed

144

approach using measures of precision and recall.

145

Background

146

Natural Language Processing (NLP)

147

NLP is a subfield of artificial intelligence (AI) that aims at making natural language text or

148

speech computer-understandable, so that the text or speech could be processed by computers in a

149

human-like manner (Cherpas 1992). Examples of NLP-enabled applications include automated

150

natural language translation and automated text summarization (Marquez 2000). Examples of

151

NLP subtasks include tokenization, POS tagging, semantic role labeling (Gildea and Jurafsky

152

2002), and named entity recognition (Roth and Yih 2004). NLP tasks may take two main

153

approaches: a machine learning (ML)-based approach or a rule-based approach. A ML-based

154

approach utilizes ML algorithms for text processing (e.g., Pradhan et al. 2004), whereas a rule-

155

based approach utilizes manually-coded rules (e.g., Soysal et al. 2010). Rule-based methods

156

require more human effort for rule development, but tend to show better text processing

157

performance (Crowston et al. 2010). From another viewpoint, NLP approaches could be either

158

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

shallow or deep. Shallow NLP conducts partial analysis of a sentence or extracts partial, specific

159

information from a sentence (e.g., entities or concepts). Deep NLP aims at full sentence analysis

160

towards capturing the entire meaning of a sentence (Zouaq 2011). The state-of-the-art in NLP

161

has achieved reasonable performances for shallow NLP tasks, whereas it is still being challenged

162

by deep NLP tasks. Deep NLP requires elaborate knowledge representation and reasoning which

163

remains to be a challenge for AI (Tierney 2012).

164

In the construction domain, there has been a number of important research efforts that have

165

utilized NLP techniques. For example, Caldas and Soibelman (2003) have conducted ML-based

166

text classification of construction documents. For an overview of some of these efforts, the

167

reader is referred to Zhang and El-Gohary (2013c).

168

Rule-Based NLP using Pattern-Matching-Based Rules

169

Pattern-matching-based rules are widely used in NLP tasks such as POS tagging (Abney 1997;

170

Yin and Fan 2013), information extraction (Califf and Mooney 2003), and text understanding

171

(Goh et al. 2006). The idea of pattern-matching-based rules is to define a set of results when the

172

matching of a pattern of a specific sequence (or structure like a tree) of elements (e.g., characters,

173

tokens, symbols, terms, concepts) occurs. Pattern-matching-based rules have a variety of

174

implementations tailored to different purposes and domains. But, they all share the same rule

175

schema of “if pattern then result” or the mapping of “from pattern to result”. For example, in the

176

proposed SeM rules, the result is the transformation of information instances into logic clause

177

elements; while in the proposed CoR rules, the result is the deletion or conversion of certain

178

information instances and/or their information tags to resolve conflicts.

179

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Semantic Modeling and Semantic NLP

180

A semantic model aims at capturing the meanings of a domain or topic, usually in a structured

181

manner. Ontology is a widely-used type of semantic model; it is defined as “an explicit

182

specification of a conceptualization” (Gruber 1995). An ontology is, commonly, composed of

183

concept hierarchies, relationships between concepts, and axioms. The axioms are used together

184

with the concepts and relationships to define the semantic meaning of the conceptualization. An

185

ontology is easily reusable and extendable (El-Gohary and El-Diraby 2010). The use of a

186

semantic model could help in NLP tasks. For example, semantic-based IE has been shown to

187

achieve better performance than syntactic-only IE (Soysal et al. 2010; Zhang and El-Gohary

188

2013c).

189

Logic-Based Information Representation and Reasoning

190

There are several types of formally-defined logic with varying degrees of descriptive capabilities

191

(prepositional logic, predicate logic, modal logic, description logic, etc.). Among the different

192

types, FOL is the most widely-used for logic-based inference-making. A Horn Clause (HC) is

193

one of the most restricted forms of FOL. Inference-making in FOL is most efficient using HC

194

logic clauses, because of such restricted form (Saint-Dizier 1994). A HC is composed of a

195

disjunction of literals of which at most one is positive. All HCs can be represented as rules that

196

have one or more antecedents (i.e., left-hand sides (LHSs)) that are conjoined (i.e., combined

197

using ‘and’ operator), and a single positive consequent (i.e., right-hand side (RHS)). For example,

198

“compliant(T) :- thickness(T) , exterior_basement_wall(W) , has(W,T) ,

199

greater_than_or_equal(T, quantity(71/2, inches))” is a HC; where “,” is the conjunctive operator

200

(i.e., “A , B” means “A and B”) and “:-” is the implication operator (i.e., “B :- A” means “A

201

implies B”). There are three types of HCs: (1) one or more antecedents and one consequent, (2)

202

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

zero antecedents and one consequent, and (3) one or more antecedents and zero consequents.

203

Inference-making using HCs could be automatically and efficiently conducted, which makes it

204

suitable for supporting automated reasoning for ACC.

205

Proposed Information Transformation Methodology

206

The proposed ITr takes a rule-based, semantic NLP approach. It utilizes pattern-matching-based

207

rules to automatically generate logic clauses based on the extracted information instances and

208

their associated patterns of information tags. Both syntactic information tags (i.e., tags tagging

209

syntactic text features, e.g., ‘adjective’ is represented using the POS tag ‘JJ’) and semantic

210

information tags (i.e., tags tagging semantic text features, e.g., ‘compliance checking attribute’ is

211

represented using the semantic tag “a”) are used in defining the patterns. A number of NLP

212

techniques (e.g., POS tagging, term matching) are used to identify the syntactic information tags

213

of each extracted information instance, and a semantic model (an ontology that represents

214

domain knowledge) is used to identify the semantic information tags. The tagged information

215

instances are transformed into HC-type logic clauses using a set of SeM rules and CoR rules.

216

SeM rules define how to process the extracted information instances, based on their associated

217

types of information tags and the context of the information tags, so that the extracted

218

information instances could be transformed correctly into logic clauses. CoR rules resolve

219

potential conflicts that may exist in the processing of different information tags. A bottom-up

220

method is utilized to handle complex sentence components. A “consume and generate”

221

mechanism is proposed to implement the bottom-up method and execute the SeM rules.

222

The following subsections present the proposed ITr methodology (Figure 2) in more detail.

223

Insert Figure 2

224

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

The Source: Extracted Information Instances

225

The information source for the ITr process is the set of input information instances that were

226

obtained from the preceding IE process. Information instances have been labeled with

227

information tags during IE. The implemented changes/improvements on the authors’ IE work

228

since Zhang and El-Gohary (2013c) are: (1) in addition to semantic information tags, syntactic

229

information tags and combinatorial information tags are also generated for further use in ITr; and

230

(2) instead of the top-down method for handling complex sentence components (processing

231

larger chunks of texts first, then breaking them down to process smaller chunks of texts), a

232

bottom-up method (processing smaller chunks of texts first, then aggregating them to process

233

larger chunks of texts) is adopted because – in the experiments – it has shown to achieve better

234

performance in handling complex sentence components (Zhang and El-Gohary 2013b). As such,

235

in the ITr process, the following three types of information tags (information tags will be shown

236

using single quotes hereafter) are defined and used: (1) semantic information tags, (2) syntactic

237

information tags, and (3) combinatorial information tags.

238

Semantic information tags are information tags that are related to the meaning and context of the

239

labeled information instances. Instances of semantic information tags are recognized based on

240

the concepts and relations in the domain ontology. For example, in the developed ontology, both

241

“transverse reinforcement” and “vertical reinforcement” are subconcepts of the concept ‘subject’.

242

Therefore, the appearances of “transverse reinforcement” (or “transverse reinforcements”) and

243

“vertical reinforcement” (or “vertical reinforcements”) in Chapter 19 of IBC 2009 will be

244

extracted as instances of the semantic information tag ‘subject’. The decision on which concepts

245

and relations are essential to extract and transform is based on the type of requirement (e.g.,

246

quantitative requirements) that is being checked. For example, ‘subject’ is one example of a

247

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

semantic information tag that is essential in the context of compliance checking of quantitative

248

requirements.

249

Syntactic information tags are information tags that are related to the grammatical role of the

250

labeled information instances. Instances of syntactic information tags are recognized based on

251

their syntactic features. Syntactic information tags carry information that is more general than

252

those carried by semantic information tags. For example, the syntactic information tag ‘noun’ is

253

describing the labeled information instance as a noun, while semantically the noun could

254

possibly belong to a ‘subject’, ‘compliance checking attribute’, or another semantic information

255

tag. In the proposed methodology, POS tags are mainly used as the syntactic features for

256

syntactic information tags. For example, ‘JJ’ is the POS tag for adjective. It is a syntactic

257

information tag for an information instance that describes properties/attributes of a noun. For

258

example, the adjective “habitable” in “habitable room” is describing the functional property of

259

“room”.

260

Combinatorial information tags are compound information tags that are composed of multiple

261

semantic and/or syntactic information tags. For example, the combination of ‘past participle verb’

262

(POS tag ‘VBN’) and ‘preposition’ (POS tag ‘IN’) is a combinatorial information tag

263

(combining two syntactic information tags) that describes a directional passive verbal relation

264

represented by bigrams like “provided by” and “located in”. The combination of ‘adjective’

265

(syntactic information tag - POS tag ‘JJ’) and ‘subject’ (semantic information tag‘s’) is another

266

example of a combinatorial information tag (combining syntactic and semantic information tags)

267

that describes a ‘subject’ with a certain property.

268

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

The Target: Logic Clauses

269

The target of the ITr process is the set of output logic clauses which are used to represent the

270

requirements in construction regulations. A HC format is used for such representation, in order to

271

facilitate further automated reasoning using logic programs. One single HC represents one

272

requirement. The RHS of the HC (in Prolog syntax the logical RHS appears to the left of “:-”)

273

indicates the compliance result(s). The LHS of the HC encodes the conditions for the

274

requirement using one or more predicates. Each predicate defines either a concept information

275

instance (e.g., court(C)) or a relation information instance (e.g., has(C,W)). The logic clause

276

elements in a concept predicate are called concept logic clause elements. The logic clause

277

elements in a relation predicate are called relation logic clause elements. Table 1 shows the

278

source and target for a sample sentence.

279

Insert Table 1

280

Semantic Mapping (SeM) Rules

281

The semantic mapping (SeM) rules define how to process the extracted information instances

282

according to their semantic meaning. The semantic meaning of each information instance is

283

defined by: (1) the information tag it is associated with. For example, in Table 1, ‘subject’

284

defines the semantic meaning of “court”, i.e., it defines that “court” is the ‘subject’ of

285

compliance checking; and (2) the context of the extracted information instance, reflected by the

286

information tags of its surrounding information instances. For example, in the following sentence,

287

the semantic meaning of “not less than” (instance of ‘comparative relation’) is defined by the

288

information tag of its surrounding information instance “for each”: “The minimum net area of

289

ventilation openings shall not be less than 1 square foot for each 150 square feet of crawl space

290

area”. “For each”, here, indicates that “not less than” (relation) is not simply a relationship

291

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

between “net area” (instance of ‘compliance checking attribute’) and “1 square foot” ( instance

292

of ‘quantity value’ + ‘quantity unit’), but it is also restricted by “150 square feet of crawl space

293

area” (instance of a ‘quantity value’ + ‘quantity reference’). The interpretation of this

294

requirement is that the quantity requirement on “minimum net area of ventilation openings” will

295

increase 1 foot for each additional “150 square feet of crawl space area”.

296

The semantic meanings of information instances are utilized in patterns on the LHS of SeM rules.

297

For the example in Table 1, the corresponding SeM rule pattern is ‘subject’ + ‘modal verb’ +

298

‘negation’ + ‘be’ + ‘comparative relation’ + ‘quantity value’ + ‘quantity unit’ + ‘preposition’ +

299

‘compliance checking attribute’. An SeM rule with this LHS pattern will transform the

300

information instances into the logic clause shown in the last row of Table 1. A sample action

301

defined on the RHS of this SeM rule is: “Generate predicates for the ‘subject’ information

302

instance, the ‘attribute’ information instance, and a ‘has’ information instance. The two

303

arguments of the ‘has’ information instance are from the ‘subject’ predicate and the ‘attribute’

304

predicate, respectively”. Accordingly, the following logic clause elements are generated for the

305

following statement, since “court” is recognized as a ‘subject’ information instance and “width”

306

as an ‘attribute’ information instance.

307

 Sentence: “Courts shall not be less than 3 feet in width”

308

 Logic Clause Elements: court(Court), width(Width), has(Court,Width)

309

The ITr method is intended to process each term of a sentence in a sequential manner. In general,

310

sequential processing for information transformation normally requires information instances

311

that are matched by patterns (in SeM rules) to be strictly located next to each other. Such a rigid

312

processing requirement could cause difficulty in processing sentences with different structures.

313

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

To avoid that, the proposed SeM rules do not follow such a rigid requirement. Instead, the SeM

314

rules allow for “look-back searching” (i.e., searching to the left of the matched words) and “look-

315

ahead searching” (i.e., searching to the right of the matched words) to find instances that match

316

certain information tags in a pattern. For example, in the following pattern, the instance of the

317

first ‘subject’ does not have to be located right next to the instance of ‘preposition’: “ ‘subject’ +

318

‘preposition’ + ‘subject’ “. It is only required to be the ‘subject’ instance that is closest to the

319

‘preposition’ instance from the left. “Look-back searching”, here, searches to the left of the

320

matched word for ‘preposition’ to find the closest ‘subject’ instance when the later part of the

321

pattern “ ‘preposition’ + ‘subject’ ” is matched. This allows for more flexibility in the use of

322

SeM rules to handle sentence complexities (e.g., those incurred by cases such as tail recursive

323

nested clauses). For example, an SeM rule uses the following pattern P1 to match the last three

324

information instances in InS1 (‘s’ for ‘subject’, ‘VBP’ for ‘non-3rd person singular present verb’,

325

‘dpvr’ for ‘directional passive verbal relation’, and ‘VB’ for ‘base form verb’), finds the first

326

information instance in InS1 through “look-back searching”, and generates the logic clause

327

elements LC1 for the part of sentence S1:

328

 Pattern P1: ‘non-3rd person singular present verb’ ‘directional passive verbal relation’

329

‘base form verb’

330

 Information Instances InS1: (‘connection’, ‘s’) … (‘are’, ‘VBP’), (‘designed_to’, ‘dpvr’),

331

(‘yield’, ‘VB’)

332

 Sentence S1: “Connections that are designed to yield shall be capable of …”

333

 Logic Clause Elements LC1: connection(Connection), yield(Yield),

334

designed_to(Connection,Yield)

335

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

In the proposed methodology, application-specific SeM rules are developed based on a

336

randomly selected sample of text (called “development text”, which is also used for text analysis

337

and further development of CoR rules). For developing a set of SeM rules for ITr, a three-step,

338

iterative methodology that shall be applied to each sentence is proposed: (1) find all relations in a

339

sentence (e.g., “of” and “not exceed” in the sentence “Spacing of transverse reinforcement shall

340

not exceed 8 inches.”), (2) for each relation, run the existing SeM rule set to check if the rule set

341

can generate the corresponding logic clause elements correctly and define the subsequent action

342

based on the following three cases: (a) if the corresponding logic clause elements are correctly

343

generated, then move to check the next relation, (b) if the corresponding logic clause elements

344

are incorrectly generated, then create a new SeM rule with a more specific pattern (i.e., a longer

345

pattern with more features) than the applied SeM rule and add it to the rule set with a higher

346

priority, and (c) if the corresponding logic clause elements are not generated, then create a new

347

SeM rule and add it to the rule set; and (3) after all relations have been checked, run the updated

348

SeM rule set on all checked sentences and check if errors have been introduced due to the added

349

SeM rules. If errors have been introduced, then identify the source(s) of errors (i.e., the rule(s)

350

that introduced the errors) and adjust those rules as necessary.

351

Conflict Resolution (CoR) Rules

352

The conflict resolution (CoR) rules resolve conflicts between information tags. Two types of

353

CoR rules are used: deletion CoR rules and conversion CoR rules. Deletion CoR rules resolve

354

conflicts between information tags by deleting certain information instances. For example, the

355

following deletion CoR rule CoR1 is used to delete redundant information instances InS2 (‘cr’

356

for ‘candidate restriction’) from the set of extracted information instances InS3 (‘s’ for ‘subject’)

357

for the sentence S2:

358

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

 Deletion CoR Rule CoR1: “if an information instance has the tag ‘subject’ and it

359

subsumes its following information instance(s), then delete its following information

360

instance(s).”

361

 Information Instances InS2: (‘exterior’, ‘cr’), (‘basement’, ‘cr), (‘wall’, ‘cr’)

362

 Information Instances InS3: (‘exterior basement wall’, ‘s’), (‘exterior’, ‘cr’), (‘basement’,

363

‘cr’), (‘wall’, ‘cr’)

364

 Sentence S2: “The thickness of exterior basement walls and foundation walls shall be not

365

less than 71/2 inches.”

366

Conversion CoR rules resolve conflicts between information tags by converting information tags

367

of information instances into other types of information tags. For example, the following

368

conversion CoR rule CoR2 is used to convert information tags in information instances InS4 (‘s’

369

for ‘subject’, ‘I’ for ‘inter clause boundary relation’, and ‘a’ for ‘compliance checking attribute’)

370

to information tags in information instances InS5 (‘IN’ for ‘preposition’) for the sentence S3:

371

 Conversion CoR Rule CoR2: “if ‘with’ is directly followed by an information instance

372

that has the information tag ‘compliance checking attribute’ and ‘with’ has the

373

information tag ‘inter clause boundary relation’, then convert the information tag of ‘with’

374

to ‘preposition’.”

375

 Information Instances InS4: (‘wall segment’, ‘s’), (‘with’, ‘I’),

376

(‘horizontal_length_to_thickness_ratio’, ‘a’)

377

 Information Instances InS5: (‘wall segment’, ‘s’), (‘with’, ‘IN’),

378

(‘horizontal_length_to_thickness_ratio’, ‘a’)

379

 Sentence S3: “Wall segments with a horizontal length-to-thickness ratio less than 2.5

380

shall be designed as columns.”

381

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

In the proposed rule-based ITr, the CoR rules are executed before the SeM rules, after the

382

information instances have been extracted by the IE process. The development of CoR rules is

383

needed when conflicts between SeM rules cannot be resolved by adjusting SeM rule patterns and

384

actions. For developing a set of CoR rules for ITr, a five-step methodology is proposed: (1) find

385

information tags that are the sources of errors through pattern analysis of conflicting SeM rules,

386

(2) for each conflict, create a new candidate CoR rule to resolve the conflict, (3) try the candidate

387

rule and empirically analyze whether the conflict was resolved without introducing new conflicts

388

or not, (4) if the trial was successful, then add the candidate CoR rule as a new rule to the

389

existing CoR rule set, and if the trial was unsuccessful, then iterate Steps 3 and 4 until a

390

successful trial is found, and (5) after each new CoR rule is added, check all SeM rules and

391

update them as necessary according to the changes in information tags caused by the new CoR

392

rule.

393

Bottom-up Method for Handling Complex Sentence Components

394

Due to the variability of natural language expressions and structures, sentences used in

395

regulatory provisions could be very complex. For example, phrases and clauses could be

396

continuously attached/nested to a sentence to constantly enrich it with more relevant information.

397

Complex sentences are difficult to process for information extraction and transformation.

398

Complex sentence components are intermediately-processed segments of text that are: (1)

399

expressed using a variety of natural language structure patterns, and (2) composed of multiple

400

concepts and relations. Complex sentence components are more likely to result in complex

401

sentence structures by embedding in or attaching more concepts and relations to a sentence.

402

Figure 3 shows a complex sentence from IBC 2006. Two methods were explored in handling

403

complex sentence components:top-down method and bottom-up method (Figure 4). The top-

404

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

down method starts from the top level (i.e., full sentence) and proceeds down to identify and

405

process complex sentence components. The bottom-up method starts from the lowest level (i.e.,

406

single terms/concepts/relations in a sentence) and proceeds up to identify and process complex

407

sentence components. The bottom-up method is employed in the proposed ITr approach, because

408

– based on the authors’ previous work – it has shown to achieve better performance than the top-

409

down method (Zhang and El-Gohary 2013b).

410

Insert Figure 3

411

Insert Figure 4

412

In the bottom-up method, the SeM rules are used to process sentences starting from the lowest

413

level, i.e., starting from information instances (which correspond to single

414

terms/concepts/relations in a sentence). The information instances in the source text are put into

415

lists – one list for each sentence and are processed one by one until all information instances

416

have been processed. The order of the instances in the list is determined based on their order in

417

the original sentence.

418

To apply the bottom-up method, the authors propose a new “consume and generate” mechanism

419

to execute the SeM rules in a sequential manner. This mechanism follows the heuristics of the

420

“sliding window” method in computational research (i.e., a sequence of data is sequentially

421

processed, segment by segment, and each segment has a predefined fixed length (i.e., the

422

“window size”)) and the mechanism of transcription in genetics domain (i.e., a sequence of DNA

423

is sequentially transcribed, segment by segment, and each segment has a length of about 17 base-

424

pair). The “consume and generate” mechanism processes all text segments that match an SeM

425

rule pattern, where each segment matches a pattern of one SeM rule and each pattern consists of

426

information tags for a sequence of information instances. However, in comparison to the “sliding

427

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

window” method, the segment length in the proposed “consume and generate” mechanism is not

428

fixed across patterns to allow for flexibility in capturing complex sentence structures. The length

429

of each segment is determined according to the number of information tags in the corresponding

430

SeM rule pattern. For example, the following pattern P2 has a segment length of three and

431

matches the information instances InS6 for the part of sentence S4 to generate logic clause

432

elements LC2:

433

 Pattern P2: ‘compliance checking attribute’ ‘of’ ‘subject’

434

 Information Instances InS6: (‘area’, ‘a’), (‘of’, ‘OF’), (‘space’, ‘s’)

435

 Sentence S4: “The net free ventilating area shall not be less than 1/150 of the area of the

436

space ventilated …”

437

 Logic Clauses Elements LC2: space(Space), area(Area), has(Space, Area)

438

The “consume and generate” mechanism allows for backward matching: if information instances

439

extracted from a segment of text match the later part of a pattern, then the information instance(s)

440

extracted from preceding text are checked for matching of the earlier part of the same pattern,

441

and corresponding logic clauses are generated if the check succeeds. For example, the following

442

information tags InT1 are associated with the five information instances from the part of

443

sentence S5. After the first three information instances InS7 are processed based on matching

444

with the pattern P3, two information instances “or” and “space” remain. These two remaining

445

information instances only match the later part (i.e., second and third information tags) of the

446

pattern P4 for ‘conjunctive subject’. Normally, this partial matching would not initiate the

447

processing of the information instances. However, under the proposed backward matching

448

mechanism, the preceding information instance “interior room” is checked for the matching of

449

the earlier part of the pattern for “conjunctive subject” (i.e., the first information tag: ‘subject’).

450

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Since “interior room” matches ‘subject’, the SeM rule for “conjunctive subject” gets applied and

451

the two remaining information instances are processed to generate the logic clause elements LC3

452

(where “;” is the disjunctive operator (i.e., “A ; B” means “A or B”)).

453

 Information Tags InT1: ‘compliance checking attribute’, ‘of’, ‘subject’, ‘conjunctive

454

term’, ‘subject’

455

 Sentence S5: “…the floor area of the interior room or space…”

456

 Information Instances InS7: “floor area”, “of”, “interior room”

457

 Pattern P3: ‘compliance checking attribute’ + ‘of’ + ‘subject’

458

 Pattern P4: ‘subject’ + ‘conjunctive term’ + ‘subject’

459

 Logic Clause elements LC3: interior_room(Interior_room); space(Interior_room)

460

Validation

461

Results are evaluated in terms of precision, recall, and F1 measure. Precision is the number of

462

correctly generated logic clause elements divided by the total number of generated logic clause

463

elements. Recall is the number of correctly generated logic clause elements divided by the total

464

number of logic clause elements that should be generated. F1 measure is the harmonic mean of

465

precision and recall, assigning equal weights to precision and recall. Ideally, both 100% recall

466

and precision are desired. However, given the inherent trade-off between the two measures, it is

467

difficult to achieve such a result. The ultimate goal for ACC is, therefore, to achieve 100% recall

468

of non-compliance instances – with high precision.

469

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Experimental Implementation and Validation

470

For testing and validation, the proposed ITr methodology was empirically implemented in

471

transforming information instances of quantitative requirements, which were automatically

472

extracted from the IBC 2009, into logic clauses.

473

Source Text Selection

474

The proposed ACC approach and ITr methodology are intended to process information from a

475

variety of construction-related textual regulatory documents (e.g., building codes, environmental

476

regulations, safety regulations and standards). Since building codes are the primary sets of

477

regulations governing the design, construction, operation, and maintenance of residential and

478

commercial buildings, they were chosen for testing the proposed ITr methodology. In the U.S.,

479

almost all state authorities (except for Delaware, Massachusetts, Mississippi, and Missouri)

480

adopt versions of the IBC by ICC. Thus, IBC was selected as the source text corpus. More

481

specifically, IBC 2006 and IBC 2009 were selected because of their availability and easiness for

482

comparison (with the authors’ previous NLP work in which IBC 2006 and IBC 2009 were used

483

for testing and validation) (Zhang and El-Gohary 2013c).

484

The SeM and CoR rules were developed based on Chapters 12 and 23 of IBC 2006, and the

485

proposed ITr algorithms were tested in processing information instances of “quantitative

486

requirements” that were extracted from Chapter 19 of IBC 2009. A quantitative requirement is a

487

requirement which defines the relationship between an attribute of a certain building

488

element/part and a specific quantity value (or quantity range). For example, the following

489

sentence, states that the width (attribute) of court (building element/part) should be greater than

490

or equal to 3’ (quantity value): “Couts shall not be less than 3 feet in width”. The authors decided

491

to The experiment on the extraction of quantitative requirements because: (1) IBC 2006 and IBC

492

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

2009 describe many quantitative requirements (e.g., on average, quantitative requirements

493

represent 41% of the requirements in Chapters 12 and 23 of IBC 2006 and Chapter 19 of IBC

494

2009), which ensures a sufficient amount of relevant sentences for development and testing; and

495

(2) sentences describing quantitative requirements appear to be more complex than those

496

describing other types of requirements (e.g., existential requirements, which requires the

497

existence of a certain building element/part), which implies that they are more difficult to

498

process. This makes quantitative requirements good candidates for testing.

499

Tool Selection

500

The proposed TC, IE, and ITr algorithms were combined into one computational platform. The

501

representation of Prolog was selected for logic clause representation, in order to facilitate future

502

CR. Prolog is an approximate realization of the logic programming computational model on a

503

sequential machine (Sterling and Shapiro 1986). It is the most popular logic programming

504

language with a reasoner. The syntax of B-Prolog was used. B-Prolog is a Prolog system with

505

extensions for programming concurrency, constraints, and interactive graphics. It has bi-

506

directional interface with C and Java (Zhou 2012). To facilitate quantitative reasoning, a set of

507

built-in rules were developed to perform arithmetic and comparative operations on the proposed

508

quantitative representation. The TC and IE algorithms were implemented using the General

509

Architecture for Text Engineering (GATE) tools (Univ. of Sheffield 2013). GATE has a variety

510

of built-in tools for a variety of text processing functions (e.g., tokenization, sentence splitting,

511

POS tagging, gazetteer compiling, and morphological analysis). For ITr, the SeM rules and CoR

512

rules were implemented using Python programming language (v3.3.2). The “re” module (i.e.,

513

regular expression module) in Python was used for pattern matching, so that each extracted

514

information instance could be used for subsequent processing steps based on their information

515

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

tags (example tags are shown in Figure 3). A domain ontology was developed and used to

516

facilitate semantic IE and ITr. In developing the ontology, the ontology development

517

methodology in El-Gohary and El-Diraby (2010) was followed. The GATES’ built-in ontology

518

editor was used for ontology building and editing.

519

Information Representation

520

Two types of logic statements in B-Prolog syntax were utilized: facts and rules. A rule has the

521

form: “H :- B1, B2, …, Bn. (n>0)”. H, B1, …, Bn are atomic formulas. H is called the head, and

522

the RHS of ‘:-’ is called the body of the rule. A fact is a special kind of rule whose body is

523

always true (Zhou 2012). Each requirement rule in IBC 2006 and IBC 2009 is represented as one

524

single B-Prolog rule. Instances of concepts are represented using unary predicates. For example,

525

the information instance “floor” is represented by the predicate “floor(F)”, with “floor” being the

526

predicate name and the variable “F” (all variables in B-Prolog start with capitalized letter) being

527

the argument for the predicate. Instances of relations are represented using binary or n-ary

528

predicates. For example, “provided with” is a relation which is represented as the predicate

529

“provided_with(A,B)”, while the variables “A” and “B” could be defined in the predicates

530

interior_space(A) and space_heating_system(B). Each design fact, on the other hand, is

531

represented using one B-Prolog fact. The B-Prolog reasoner can then automatically reason about

532

the facts and rules and, accordingly, determine the compliance checking result(s). An example is

533

shown in Figure 5.

534

Insert Figure 5

535

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Information Tags

536

A total of 40 information tags were developed for use in the SeM rules and CoR rules for ITr. A

537

total of 17, 22, and 1 semantic information tags, syntactic information tags, and combinatorial

538

information tags were used, respectively.

539

Two main types of semantic information tags were defined (as per Figure 6): essential

540

information tags and secondary information tags. Essential information tags are tags for

541

information that must be defined for this specific type of requirement. Six main types of essential

542

information tags were defined for quantitative requirements: subject, compliance checking

543

attribute, comparative relation, quantity value, quantity unit, and quantity reference. A ‘subject’

544

is an ontology concept; it is a “thing” (e.g., building object, space) that is subject to a particular

545

regulation or norm. A ‘compliance checking attribute’ is an ontology concept; it is a specific

546

characteristic of a ‘subject’ by which its compliance is assessed. A ‘comparative relation’ is an

547

ontology relation which is commonly-used for comparing quantitative values (i.e., comparing an

548

existing value to a required minimum or maximum value). Five subtypes of comparative

549

relations were further defined: ‘greater than or equal to’, ‘greater than’, ‘less than or equal to’,

550

‘less than’, and ‘equal to’. A ‘quantity value’ is a value, or a range of values, which defines the

551

quantified requirement. A ‘quantity unit’ is the unit of measure for the ‘quantity value’. A

552

‘quantity reference’ is a reference to another quantity (which includes a value and a unit).

553

Secondary information tags are tags for information that are not necessary for this specific type

554

of requirement, but may exist in defining the requirement. Two main types of secondary

555

information tags were defined for quantitative requirements: ‘restriction’ and ‘exception’. A

556

‘restriction’ is a concept that places a constraint on the ‘subject’, ‘compliance checking attribute’,

557

‘comparative relation’, pair of ‘quantity value’ and ‘quantity unit’, pair of ‘quantity value’ and

558

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

‘quantity reference’, or the full requirement. A ‘subject restriction’ is a concept that places a

559

constraint on the ‘subject’. Two subtypes of ‘subject restriction’ were further defined: ‘possesive

560

subject restriction’ and ‘nonpossesive subject restriction’. A ‘possesive subject restriction’ places

561

a possessive constraint on the ‘subject’, thereby restricting the ‘subject’ to possess certain

562

building parts or properties. For example, in the following requirement sentence, “having

563

windows opening on opposite sides” is a ‘possessive subject restriction’ on “court”: “Courts

564

having windows opening on opposite sides shall not be less than 6 feet in width”. A

565

‘nonpossesive subject restriction’ places a nonpossesive constraint on the ‘subject’, thereby

566

restricting the ‘subject’ not to possess certain building parts or properties. A ‘compliance

567

checking attribute restriction’ places a constraint on the ‘compliance checking attribute’, thereby

568

restricting the ‘compliance checking attribute’ to a more specific type. For example, in the

569

following requirement sentence, “to the outdoors” is a ‘compliance checking attribute restriction’

570

on “minimum openable area”: “The minimum openable area to the outdoors shall be 4 percent of

571

the floor area being ventilated”. A ‘comparative relation restriction’ places a constraint on the

572

‘comparative relation’, thereby restricting the ‘comparative relation’ using new conditions. For

573

example, in the following requirement sentence, “for each 150 square feet of crawl space area” is

574

a ‘comparative relation restriction’ on “not less than”: “The minimum net area of ventilation

575

openings shall not be less than 1 square foot for each 150 square feet of crawl space area”. A

576

‘quantity restriction’ places a constraint on the ‘quantity value’ + ‘quantity unit’/’quantity

577

reference’ pair, thereby specifying the properties (e.g., range) of the pair. A ‘full requirement

578

restriction’ places a constraint on the whole quantitative requirement, thereby restricting the

579

quantitative requirement with new preconditions. An ‘exception’ defines a condition where the

580

described requirement does not apply.

581

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

For syntactic information tags, the Hepple POS Tagger was used to generate POS tag features.

582

Some additional syntactic features that were not in the Hepple POS Tagger (e.g., the preposition

583

“of”) were also defined. Each selected POS type and defined syntactic feature represents a

584

syntactic information tag such as adjective (POS tag ‘JJ’) and preposition “of” (the literal “OF”).

585

One combinatorial information tag was defined for use in this implementation and was called

586

‘directional passive verbal relation’, which is the combination of ‘past participle verb’ (POS tag

587

‘VBN’) and ‘preposition’ (POS tag ‘IN’). Combinatorial information tags are expressive and

588

flexible. Thus, more combinatorial information tags may be defined and used if more complex

589

information tags are needed to capture complex meanings or patterns.

590

Insert Figure 6

591

Gold Standard

592

The gold standard for Chapter 19 of IBC 2009 was developed semi-automatically. In the authors’

593

previous work, all sentences that include a number (both appearances of digits and words forms

594

of a number) were automatically extracted to ensure a 100% recall of sentences describing

595

quantitative requirements. Then, one of the authors manually deleted false positive sentences.

596

After that, one of the authors manually coded the logic clauses based on the extracted

597

information instances from each sentence. The gold standard was reviewed by two other

598

researchers to verify its correctness. Because of the unambiguous nature of quantitative

599

requirements, along with the well-defined information representation that is used in the proposed

600

methodology, there was an agreement in formulating the gold standard. For Chapter 19, 62

601

sentences containing quantitative requirements were recognized. Correspondingly, 62 logic

602

clauses were coded. In these 62 logic clauses, 1901 logic clause elements were identified,

603

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

including 568 logic clause elements for describing concepts and 1333 logic clause elements for

604

describing relations between concepts.

605

Algorithm Implementation

606

The proposed ITr methodology was implemented using Python programming language. The

607

processing steps of an example sentence and the pseudo codes for the main algorithm and the

608

“consume and generate” mechanism are shown in Figure 7, Figure 8, and Figure 9, respectively.

609

Insert Figure 7

610

Insert Figure 8

611

Insert Figure 9

612

As shown in Figure 7, the IE process tags the original sentence with information tags (from Part I

613

to Part II). The main ITr algorithm then represents each information instance in the tagged

614

sentence into a four-tuple (from Part II to Part III). The CoR rules in the main algorithm then

615

process the information instance tuple list to resolve conflicts between tuples (from Part III to

616

Part IV). The “consume and generate” code then executes the set of SeM rules to process each

617

tuple in the list and generate logic clause elements based on matching of SeM rule patterns (from

618

Part IV to Part V). For each information instance, the four-tuple is used to store: (1) the

619

information instance itself, (2) the location of the information instance in the corresponding

620

sentence (represented by the starting point of the information instance in the sentence), (3) the

621

length of the information instance in terms of number of letters, and (4) the information tag of

622

the information instance (e.g., ‘Interior’, 0, 15, and ‘s’ for the first information instance in Part

623

III of Figure 7).

624

In the main algorithm (Figure 8), the CoR rules are executed through the function “resolve

625

conflicts”. Then, the SeM rules are executed using the “consume and generate” code to process

626

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

the conflict-free information instances for each sentence of the source text file (in the format of a

627

list of four tuples) to generate and display the corresponding logic clause. As shown in Figure 9,

628

the “consume and generate” code checks through the patterns for each SeM rule (PATTERN1,

629

PATTERN2, PATTERN3…) and generates logic clauses as a result of matching to SeM rules. In

630

case of no matching, the default negative step length enables backward matching.

631

Experimental Results and Discussion

632

The proposed ITr algorithms were tested in transforming information instances of quantitative

633

requirements, which were automatically extracted from Chapter 19 of IBC 2009, into logic

634

clauses. The following two experiments were conducted for comparing the performances of two

635

methods of information representation: (1) using essential semantic information tags only, and (2)

636

using essential, as well as secondary, semantic information tags.

637

In Experiment #1, only the essential semantic information tags were used: ‘subject’, ‘compliance

638

checking attribute’, ‘comparative relation’, ‘quantity value’, ‘quantity unit’, and ‘quantity

639

reference’. A subset of the gold standard (including logic clause elements corresponding to the

640

essential semantic information instances) was used as the gold standard for Experiment #1. A

641

total of 53 and 11 SeM and CoR rules, respectively, were developed.

642

In Experiment #2, both essential and secondary information tags were used. Figure 3 shows

643

examples of some of the information tags that were used. A total of 297 and 9 SeM and CoR

644

rules, respectively, were encoded. The gold standard of Experiment #2 (the full gold standard set)

645

contains 177% more logic clause elements than those in the gold standard of Experiment #1.

646

This shows that for quantitative requirements, the source text contains much secondary

647

information instances.

648

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

The SeM rules that were developed in the experiments are classified into four main types: simple

649

SeM rules, multiple action SeM rules, multiple condition SeM rules, and complex SeM rules. A

650

simple SeM rule is the simplest type where a strict SeM pattern directly maps to a logic clause.

651

For multiple action SeM rules, other actions (called “supportive actions”) such as “look-ahead

652

searching” and “look-back searching” are involved in addition to mapping SeM patterns to logic

653

clauses. For multiple condition SeM rules, the mapping from SeM patterns to logic clauses are

654

encoded in subrules to handle subtly different cases in rule conditions such as the existence/non-

655

existence status of certain information instances. A complex SeM rule is a combination of the

656

first three types of rules; it utilizes both supportive actions and subrules to support mappings

657

from SeM patterns to logic clauses.

658

The logic clauses generated from the SeM rules are classified into three main types: single

659

predicate logic clauses, multiple predicate logic clauses, and compound predicate logic clauses.

660

A single predicate logic clause includes only one single predicate (e.g., “space(Space)”). A

661

multiple predicate logic clause includes more than one predicate (e.g., “space(Space), area(Area),

662

has(Space, Area)”). A compound predicate logic clause has predicate(s) that embed other

663

predicate(s) as argument(s) (e.g., “greater_than_or_equal(T, quantity(71/2, inches))”).

664

665

Table 2 shows the patterns of the most applied SeM rules (i.e., rules applied at least three times)

666

in the experiments. The patterns of the rest of the applied SeM rules are shown in Table 3.

667

Insert Table 2

668

Insert Table 3

669

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

The overall performance results of Experiment #1 and Experiment #2 are summarized in Table 4

670

and Table 5, respectively.

671

Insert Table 4

672

Insert Table 5

673

A comparison between the results of Experiment #1 and those of Experiment #2 is summarized

674

in Table 6. The number of information tags in Experiment #2 increased 400% from that used in

675

Experiment #1. The increase in the number of SeM rules was of similar magnitude (460%).

676

Through analysis, the causes of this increase in the number of SeM rules were found to be: (1)

677

the use of more information tags increases the length of patterns in SeM rules, which in turn

678

increases the specificity of each pattern; and (2) the use of more information tags increases the

679

complexity of patterns in SeM rules, which in turn increases the possible number of patterns. In

680

contrast to SeM rules, the number of CoR rules decreased from Experiment #1 to Experiment #2.

681

This results from the use of more information tags, which leads to better distinguishable

682

information instances, and in turn leads to less conflicts between information instances.

683

The algorithms achieved 92.5% and 98.2%, 95.1% and 99.1%, and 93.8% and 98.6% overall

684

precision, recall, and F1 measure for Experiment #1 and Experiment #2, respectively. Both

685

precision and recall improve in Experiment #2, because the use of more information tags could:

686

(1) better distinguish and capture the variations in expressions; and (2) help define SeM rules

687

with more specificity in patterns. Based on the comparative analysis, the following conclusion

688

can be drawn: the use of more information tags helps in improving the performance of

689

information transformation.

690

Insert Table 6

691

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

The precisions of relation logic clause elements are lower than other precision and recall values

692

across Experiment #1 and Experiment #2. Through analysis, four main causes for this relatively

693

lower performance of precision (89.8% and 97.5% for Experiment #1 and Experiment #2,

694

respectively) of relation logic clause elements are recognized: (1) Structural ambiguity caused by

695

conjunctive terms: For example, in the following part of sentence, there are two possible

696

syntactic uses of “and” – either linking “wall piers” and “such segments” or linking the

697

preceding clause and the following clause: “…shear wall segments provide lateral support to the

698

wall piers and such segments have a total stiffness…”. The ability of the SeM rules to handle

699

structural ambiguity is limited by the development text, which may lead to errors; (2) Incorrect

700

tagging during IE: For example, “professional” (in “registered design professional”) was

701

incorrectly tagged as an adjective instead of noun. This is due to the imperfection of state-of-the-

702

art POS tagging methods; (3) Errors due to morphological analysis (MA): MA was used for

703

improving the recall of semantic information instances by finding all forms of a term based on its

704

lexical form. However, while useful in this regard, MA also introduced false positive instances.

705

For example, as a result of MA, “supported” was stemmed into “support”, matched with the

706

concept “support” in the ontology, and as a result incorrectly recognized as an instance of

707

‘subject’; and (4) Errors caused by certain SeM rules: For example, an SeM rule selects the

708

immediate left neighbor of a preposition as the first argument of that preposition. In cases where

709

the immediate left neighbor of a preposition is not its real first argument, this SeM rule causes

710

errors. For example, in the following part of sentence, “gypsum concrete” was mistakenly

711

identified as the first argument rather than “clear span”: “clear span of the gypsum concrete

712

between supports”.

713

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Analyzing other errors (other than those influencing precision of relation logic clause elements),

714

two additional causes of errors are recognized: (1) Missing tags in IE: For example, based on the

715

concepts in the ontology, “connection” should have been semantically-tagged as ‘subject’.

716

However, in a few instances, it was missing the ‘subject’ information tag. This is due to the

717

inherent errors in the NLP tools that were used (no existing NLP tool can achieve 100%

718

performance); and (2) Error in processing sentences with uncommon syntactic expression

719

structures: For example, in the part of sentence “…which have been water soaked for at least 24

720

hours…”, “soaked” (‘compliance checking attribute’) was not recognized because: (a) “soaked”

721

was not semantically-recognized because the ontology did not cover this concept, and (b) the

722

syntactic feature of “soaked” (i.e., past participle) was not a common syntactic expression for

723

‘compliance checking attribute’ (in contrast, noun is a common expression for ‘compliance

724

checking attribute’).

725

Limitations and Future Work

726

The experimental results show that the proposed approach is promising in automatically

727

transforming the extracted information instances into logic clauses for further compliance

728

reasoning. In spite of the high performance that was achieved (98.2%, 99.1%, and 98.6% for

729

precision, recall, and F1 measure, respectively), three main limitations of this work are

730

acknowledged, which the authors plan to address as part of their ongoing/future research. First,

731

the methodology was only tested on processing quantitative requirements. The types of semantic

732

patterns and conflicts in other types of requirements (e.g., existential requirements) may vary and,

733

thus, may lead to different performance results. Although the processing of other types of

734

requirements is expected to be less or equally complex than that of quantitative requirements –

735

and thus is expected to have similar or better performance, in future work, the authors plan to test

736

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

the proposed methodology on other types of requirements (e.g., existential requirements) for

737

validation. Second, due to the large amount of manual effort required in developing a gold

738

standard, the proposed ITr algorithms were tested only on one chapter of IBC 2009. Similar high

739

performance is expected when testing on other chapters of IBC and on other regulatory

740

documents, since all regulatory documents share similarities in expressions. However, different

741

performance results might be obtained due to the possible variability of text across different

742

chapters or different regulatory documents. As such, in future work, the authors plan to test the

743

proposed ITr methodology on more chapters of IBC 2009 and on other types of regulatory

744

documents (e.g., environmental regulations). Third, the validation of the proposed ITr algorithms

745

was focused on precision and recall. At this stage, the computational efficiency of the proposed

746

algorithms wasnot evaluated, although it was taken into consideration when developing the

747

algorithms. For example, the more efficient and stable merge sort (rather than quick sort) was

748

used when a sorting algorithm was needed. In future work, the authors plan to perform

749

algorithm optimization to improve the computational efficiency of the proposed algorithms, if/as

750

necessary.

751

Contribution to the Body of Knowledge

752

This research contributes to the body of knowledge in four main ways. First, domain-specific,

753

semantic NLP-based information processing methods that can achieve full sentence processing

754

and information extraction (i.e., all terms of a sentence are processed), as opposed to partial

755

sentence processing and information extraction (i.e., only specific terms/concepts are

756

processed/extracted) are offered. Domain-specific semantics allow for analyzing complex

757

sentence structures that would otherwise be too complex and ambiguous for automated IE and

758

ITr, recognizing domain-specific text meaning, and in turn allowing for

759

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

processing/understandability of full sentences. Full sentence processing/understandability allows

760

for a deeper level of NLP, namely natural language understanding. Second, this research shows

761

that a hybrid approach that combines rule-based NLP methods and semantic NLP methods could

762

achieve high performance for the combination of IE and ITr from/of regulatory text, in spite of

763

the complexity inherent in natural language text. Domain-specific expert NLP knowledge

764

(encoded in the form of rules), along with domain knowledge (represented in the form of an

765

ontology), facilitates deep text processing/understandability. Previous work (Zhang and El-

766

Gohary 2013c) showed high performance for rule-based, semantic IE. This paper further shows

767

high performance for rule-based, semantic ITr. Third, a new context-aware and flexible way of

768

utilizing pattern-matching-rule-based methods through the use of context-aware semantic

769

mapping rules is offered. This way of utilizing pattern-matching-based rules captures the details

770

(in terms of the expression, language structure, etc.) of complex sentence components, in a

771

context-aware manner, and through flexible pattern lengths. Fourth, a new mechanism

772

(“consume and generate” mechanism) for processing and transforming complex regulatory text

773

into logic clauses is offered. The proposed mechanism follows the bottom-up method, which has

774

shown based on the experimental results to outperform the top-down method in ITr. The high

775

performance that the mechanism achieved verifies that the bottom-up method is suitable for such

776

ITr tasks.

777

From a practical perspective, this work is expected to have significant impacts on four main

778

levels. First, this work facilitates ACC in the construction domain. ACC could bring down the

779

time, cost, and errors of the checking process; promote compliance of construction projects to

780

various regulations (due to easier and more frequent checking); and encourage the adoption of

781

BIM in the AEC industry. Second, the novel IE and ITr methods and algorithms proposed in this

782

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

work could be adopted/applied to automate a variety of other tasks in the construction domain,

783

such as contract document analysis and construction accident record analysis. Third, the

784

proposed ITr methodology could be adopted/applied outside of the construction domain, which

785

would contribute to the general domain of natural language processing/understanding. Fourth,

786

the results of this research could ultimately lead to defining principles for the drafting of future

787

regulations in a manner to support ACC. For example, the use of uncommon expressions that

788

tend to cause processing errors could be avoided when drafting future regulations.

789

Conclusions

790

This paper presented a rule-based, semantic NLP methodology for automated information

791

transformation (ITr) of information instances, which were automatically extracted from

792

construction regulatory documents, into logic clauses. A set of semantic mapping (SeM) rules

793

and conflict resolution rules (CoR) are used in ITr. CoR rules resolve conflicts between

794

information instances, while SeM rules transform the information instances into logic clause

795

elements. The SeM rules use context-aware and flexible information patterns. Both syntactic and

796

semantic information tags are utilized in the patterns. Syntactic information tags (e.g., POS tags)

797

are generated using NLP techniques. A semantic model helps recognize the semantic information

798

tags of each extracted information instance. A “consume and generate” mechanism is proposed

799

to handle complex sentence components and execute the SeM rules. The ITr method, thus,

800

processes almost all terms of a sentence. Such full sentence processing enables deep NLP

801

towards natural language understanding.

802

The proposed ITr algorithms were tested in transforming information instances of quantitative

803

requirements, which were automatically extracted from Chapter 19 of IBC 2009, into logic

804

clauses. The transformation results were compared with a manually-developed gold-standard.

805

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

The results showed 98.2%, 99.1%, and 98.6% precision, recall, and F1 measure, respectively.

806

This high performance shows that the proposed ITr methodology is promising. Through error

807

analysis, the following six causes of errors were recognized: (1) missing tags in IE; (2) incorrect

808

tagging during IE; (3) errors in processing sentences with uncommon expression structures; (4)

809

errors due to morphological analysis; (5) errors caused by certain SeM rules; and (6) structural

810

ambiguity. In future work, the authors plan to further refine the proposed methodology to avoid

811

those causes of errors – as much as possible, in an effort to further enhance the performance of

812

the ITr algorithms. Also, as part of the authors’ ongoing/future research, the proposed ITr

813

methodology will be tested on more chapters of building codes and on other types of

814

construction regulatory documents (e.g., environmental regulations). Similar high performance is

815

expected. However, variability in performance is possible due to differences in the characteristics

816

of the text across different chapters or documents.

817

Acknowledgement

818

The authors would like to thank the National Science Foundation (NSF). This material is based

819

upon work supported by NSF under Grant No. 1201170. Any opinions, findings, and conclusions

820

or recommendations expressed in this material are those of the authors and do not necessarily

821

reflect the views of NSF.

822

References

823

Abney, S. (1997). “Part-of-speech tagging and partial parsing.” Text, Speech and Language

824

Technology, 2(1997), 118-136.

825

Avolve Software Corporation. (2013). Electronic plan review for building and planning

826

departments. <http://www.avolvesoftware.com/index.php/solutions/building-departments/>

827

(Oct 3, 2013).

828

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Breaux, T.D., and Anton, A.I. (2008). “Analyzing regulatory rules for privacy and security

829

requirements.” IEEE Transactions on Software Eng., 34(1), 5-20.

830

Caldas, C.H., and Soibelman, L. (2003). “Automating hierarchical document classification for

831

construction management information systems.” Autom. Constr., 12(2003), 395-406.

832

Califf, M. E., and Mooney, R. J. (2003). “Bottom-up relational learning of pattern matching rules

833

for information extraction.” J. Machine Learning Research, 4(2003), 177-210.

834

Cherpas, C. (1992). “Natural language processing, pragmatics, and verbal behavior.” The

835

Analysis of Verbal Behavior, 10, 135-147.

836

Crowston, K., Liu, X., Allen, E., and Heckman, R. (2010). “Machine learning and rule-based

837

automated coding of qualitative data.” Proc., 73rd ASIS&T Annual Meeting: Navigating

838

Streams in an Information Ecosystem, Association for Information Science and

839

Technology, Silver Spring, Maryland, 1-2.

840

Ding, L., Drogemuller, R., Rosenman, M., Marchant, D., and Gero, J. (2006). “Automating code

841

checking for building designs – DesignCheck.” Clients Driving Innovation: Moving Ideas

842

into Practice, CRC for Construction Innovation, Brisbane, Australia, 1-16.

843

El-Gohary, N.M., and El-Diraby, T.E. (2010). “Domain ontology for processes in infrastructure

844

and construction.” J. Constr. Eng. Manage., 136(7), 730–744.

845

Fenves, S.J., Gaylord, E.H., and Goel, S.K. (1969). “Decision table formulation of the 1969

846

AISC specification.” Civ. Eng. Studies: Structural Research Series, 347, University of

847

Illinois, Urbana, IL, 1-167

848

Garrett, J.H., Jr., and Fenves, S.J. (1987). “A knowledge-based standard processor for structural

849

component design.” Eng. with Comput., 2(4), 219-238.

850

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Gildea, D., and Jurafsky, D. (2002). “Automatic labeling of semantic roles.” J. Comput. Linguist.,

851

28(3), 245-288.

852

Goh, O. S., Depickere, A., Fung, C.C., and Wong, K. W. (2006). “Topdown natural language

853

query approach for embodied conversational agent.” Proc., Intl. MultiConf. Eng. and

854

Comput. Sci. (IMECS 2006), The International Association of Engineers (IAENG), Hong

855

Kong, China.

856

Gruber, T.R. (1995). “Toward principles for the design of ontologies used for knowledge

857

sharing.” Intl. J. Human-Computer Studies, 43, 907-928.

858

Han, C.S., Kunz, J.C., and Law, K.H. (1998). “Client/server framework for online building code

859

checking.” J. Comput. Civ. Eng., 12(4), 181-194.

860

International Code Council (ICC). (2012). “International Code Council.” AEC3,

861

<http://www.aec3.com/en/5/5_013_ICC.htm> (Oct. 26, 2013).

862

Khemlani, L. (2005). “CORENET e-PlanCheck: Singapore's automated code checking system.”

863

AECbytes “Building the Future” Article,

864

<http://www.aecbytes.com/buildingthefuture/2005/CORENETePlanCheck.html> (Oct 26,

865

2013).

866

Kiyavitskaya, N., Zeni, N., Breaux, T.D., Anton, A.I., Cordy, J.R., Mich, L., and Mylopoulos, J.

867

(2008). “Automating the extraction of rights and obligations for regulatory compliance.”

868

Lecture Notes in Comput. Sci., 5231(2008), 154-168.

869

Marquez, L. (2000). “Machine learning and natural language processing.” Proc., “Aprendizaje

870

automatico aplicado al procesamiento del lenguaje natural”.

871

Nguyen, T. (2005). “Integrating building code compliance checking into a 3D CAD system.”

872

Proc., Intl. Conf. Comput. Civ. Eng., ASCE, Reston, VA, 1-12.

873

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Niemeijer, R.A., Vries, B. de, and Beetz, J. (2009). “Check-mate: automatic constraint checking

874

of IFC models.” Managing IT in Construction/Managing Construction for Tomorrow, A

875

Dikbas, E Ergen & H Giritli (Eds.), CRC Press, London, 479-486.

876

Pocas Martins, J.P., and Abrantes, V. (2010). “Automated code-checking as a driver of BIM

877

adoption.” Intl. J. Housing Sci., 34(4), 286-294.

878

Pradhan, S., Ward, W., Hacioglu, K., Martin, J.H., and Jurafsky, D. (2004). “Shallow semantic

879

parsing using support vector machines.” Proc, NAACL-HLT, The Association for

880

Computational Linguistics, East Stroudsburg, PA, 233-240.

881

Roth, D., and Yih, W. (2004). “A linear programming formulation for global inference in natural

882

language tasks.” Proc., 2004 Conf. Comput. Natural Language Learning (CoNLL-2004),

883

SIGNLL, Boston, MA, 1-8.

884

Saint-Dizier, P. (1994). “Advanced logic programming for language processing.” Academic

885

Press, San Diego, CA.

886

Salama, D., and El-Gohary, N. (2013a). “Semantic text classification for supporting automated

887

compliance checking in construction”. J. Comput. Civ. Eng., Accepted and published online

888

ahead of print.

889

Salama, D., and El-Gohary, N. (2013b). “Automated compliance checking of construction

890

operation plans using a deontology for the construction domain.” J. Comput. Civ. Eng.,

891

27(6), 681-698.

892

Soysal, E., Cicekli, I., and Baykal, N. (2010). “Design and evaluation of an ontology based

893

information extraction system for radiological reports.” Comput. in Biology and Med.,

894

40(11-12), 900-911.

895

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Sterling, L., and Shapiro, E. (1986). “The art of Prolog: advanced programming techniques.”

896

MIT Press, Cambridge, Massachusetts, London, England.

897

Tan, X., Hammad, A., and Fazio, P. (2010). “Automated code compliance checking for building

898

envelope design.” J. Comput. Civ. Eng., 24(2), 203-211.

899

Tierney, P.J. (2012). “A qualitative analysis framework using natural language processing and

900

graph theory.” The Intl. Review of Research in Open and Distance Learning, 13(5).

901

University of Sheffield. (2013). “General architecture for text engineering.” <http://gate.ac.uk/>

902

(Oct. 13, 2013).

903

Wyner, A., and Governatori, G. (2013). “A study on translating regulatory rules from natural

904

language to defeasible logic.” Proc., RuleML 2013: The 7th Intl. Web Rule Symposium,

905

Springer-Verlag, Berlin Heidelberg, Germany.

906

Wyner, A., and Peters, W. (2011). “On rule extraction from regulations.” Proc., JURIX 2011:

907

The 24th Intl. Conf. Legal Knowledge and Info. Systems, IOS Press, Amsterdam, The

908

Netherlands, 113-122.

909

Yin, S., and Fan, G. (2013). “Research of POS tagging rules mining algorithm.” Applied

910

Mechanics and Materials, 347 – 350(2013), 2836-2840.

911

Zhang, J., and El-Gohary, N.M. (2013a). “Information transformation and automated reasoning

912

for automated compliance checking in construction.” Proc., 2013 ASCE Intl. Workshop

913

Comput. in Civ. Eng., ASCE, Reston, VA, 701-708.

914

Zhang, J., and El-Gohary, N.M. (2013b). “Handling sentence complexity in information

915

extraction for automated compliance checking in construction.” Proc., CIB W78 2013,

916

Conseil International du Bâtiment (CIB), Rotterdam, The Netherlands.

917

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Zhang, J., and El-Gohary, N. (2013c). “Semantic NLP-based information extraction from

918

construction regulatory documents for automated compliance checking.” J. Comput. Civ.

919

Eng., Accepted and published online ahead of print.

920

Zhong, B.T., Ding, L.Y., Luo, H.B., Zhou, Y., Hu, Y.Z., and Hu, H.M. (2012). “Ontology-based

921

semantic modeling of regulation constraint for automated construction quality compliance

922

checking.” Autom. Constr., 28, 58-70.

923

Zhou, N. (2012). “B-Prolog user’s manual (version 7.7): Prolog, agent, and constraint

924

programming.” Afany Software. <http://www.probp.com/manual/manual.html> (Nov. 19,

925

2012).

926

Zouaq, A. (2011). “An overview of shallow and deep natural language processing for ontology

927

learning.” Ontology Learning and Knowledge Discovery Using the Web: Challenges and

928

Recent Advances, IGI Global., Hershey, PA, 16-38.

929

930

931

932

933

934

935

936

937

938

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

939

940

941

Tables

942

Table 1: A Transformation Example

943

Requirement

Sentence

Courts shall not be less than 3 feet in width.

Source –

Information

Tag

Subject

Compliance

Checking

Attribute

Comparative

Relation

Quantity

Value

Quantity

Unit

Quantity

Reference

Source –

Information

Instance

court

width

not less than

feet

Target –

Logic

Clause

compliant_width_of_court(Court) :- width(Width), court(Court), has(Court,Width),

greater_than_or_equal(Width,quantity(3,feet)).

944

Table 2: Patterns of the Most Applied SeM Rules in the Experiments

945

SeM Rule Pattern

Action

Condition Case

Logic Clause Generated

SeM Rule Type

[‘a’ ‘s’ ‘cr’] (a) ‘OF’

(b) [‘a’ ‘s’ ‘cr’] (c)

a(A),c(C),has(C,A)

Simple

‘dpvr’ (a) [‘s’ ‘cr’] (b)

look-back search for attribute

or subject (s); look-back

search for negation (n)

n exists

s(S),b(B),not a(S,B)

Complex

n not exists

s(S),b(B),a(S,B)

‘c’ (a) ‘v’ (b)

look-back search for attribute

or subject (s); look-ahead

search for unit or reference

(u); look-back search for

negation (n)

n exists

not a(S, quantity(b,u))

Complex

n not exists

a(S, quantity(b,u))

‘I’ ‘s’

skip

Multiple

action

‘c’ (a) ‘v’ (b) ‘u’ (c)

‘IN’ (d) ‘s’ (e)

look-back search for attribute

or subject (s)

distance(Distance),s(S),e(E),

d(S,E,Distance),a(Distance,

quantity(b,c))

Multiple

action

[‘a’ ‘s’ ‘cr’] (a) ‘CC’

(b) [‘a’ ‘s’ ‘cr’] (c)

(a(A);c(A))

Simple

[‘VB’ ^ ‘be’] (a) ‘IN’

(b) [‘cr’ ‘a’ ‘s’] (c)

look-back search for subject or

attribute (s)

s(S),c(C),b(S,C)

Multiple

action

[‘a’ ‘s’ ‘cr’] (a) ‘IN’

a(A),c(C),b(A,C)

Simple

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

(b) [‘a’ ‘s’ ‘cr’] (c)

‘Except’

mark the beginning of

exception

Multiple

action

‘n’ (a) ‘c’ (b) ‘v’ (c)

‘u’ (d)

look-back search for attribute

or subject (s)

s(S),not b(S,quantity(c,d))

Multiple

action

[‘a’ ‘s’] (a) ‘OF’ (b)

‘v’ (c) [‘u’ ‘a’] (d)

pattern preceded by

[‘a’ ‘s’ ‘cr’] (e)

[‘Has’ ‘NoHas’

‘IN’ ‘OF’ ^

‘between’] (f)

a(A),e(E),equal_to(E,

quantity(c,d))

Multiple

condition

otherwise

a(A),equal_to(A,

quantity(c,d))

‘VBP’ (a) ‘VBN’ (b)

look-back search for attribute

or subject (s)

b(S)

Multiple

action

I’ ‘CC’

skip

Multiple

action

‘s’ (a) ‘MD’ (b) ‘Has’

look-back search for attribute

or subject (s)

pattern preceded by

‘IN’

s(S),d(D),has(S,D)

Complex

otherwise

a(A),d(D),has(A,D)

‘TO’ (a) ‘VB’ (b) [‘s’

‘cr’ ‘a’] (c)

look-back search for attribute

or subject (s)

s not exists

c(C),a_b(C)

Complex

(1) ‘’: A pair of single quotes encloses information tags

946

(2) ^: A caret separates optional information tags from exceptions

947

(3) (a) , (b) , (c) , etc., show the mapping of components (in SeM patterns) to logic clause

948

elements (in generated logic clauses), where an upper case represents a variable

949

(4) Contents in the “logic clause generated” column are case-sensitive

950

951

952

Table 3: Patterns of the Rest of the SeM Rules Applied in the Experiments

953

SeM Rule Pattern

[‘a’ ‘s’ ‘cr’] ‘MD’ ‘n’ ‘VB’ ‘c’ ‘v’ ‘u’

‘VBP’ ‘dpvr’ ‘VB’

‘s’ ‘JJ’ ‘n’ ‘c’ ‘v’ ‘u’

‘n’ ‘c’ ‘s’

‘IN’ ‘ea’ [‘v’ ‘CD’] ‘u’ ‘OF’ ‘s’

[‘s’ ‘cr’] ‘VBD’ [‘cr’ ‘s’]

‘I’ ‘CC’ ‘n’ ‘C’ ‘v’ ‘u’

‘IN’ ‘VBG’ [‘cr’ ‘s’]

‘JJ’ ‘IN’ ‘c’ ‘v’ [‘u’ ‘cr’]

[‘s’ ‘cr’] ‘VBP’ [‘VBN’ ‘JJ’]

‘VB’ ‘IN’ ‘c’ ‘v’ [‘cr’ ‘s’]

‘dpvr’ ‘v’ ‘u’

‘s’ ‘MD’ ‘VB’ ‘dpvr’ [‘VBZ’ ‘cr’ ‘VB’]

‘RB’ ‘TO’ [‘s’ ‘cr’]

‘CC’ ‘v’ ‘u’ ‘IN’ ‘a’

‘MD’ ‘VB’ ‘VBN’

TO’ [‘s’ ‘cr’]

‘a’ ‘OF’ ‘v’ ‘u’ ‘by’ ‘v’ ‘u’

‘s’ ‘MD’ ‘n’ ‘VB’ ‘dpvr’

[‘cr’ ‘s’ ‘a’] [‘OF’ ‘IN’ ‘Has’ ‘NoHas’ ^ ‘for’] ‘s’ ‘IN’ ‘s’

[‘s’ ‘a’ ‘cr’] ‘I’ ‘VBG’ [‘cr’ ‘a’ ‘s’] ‘I’

‘MD’ ‘VB’ [‘a’ ‘s’ ‘cr’]

‘JJ’ ‘CC’ ‘JJR’ ‘s’

‘n’ ‘c’ ‘v’

‘s’ ‘WDT’ ‘VBP’ ‘cr’

‘n’ ‘c’ ‘CD’

‘VBG’ ‘cr’ ‘VBP’ ‘VBN’

‘v’ [‘s’ ‘cr’]

‘MD’ ‘VB’ ‘v’ ‘u’

‘s’ ‘VBN’

‘c’ ‘v’ ‘ea’ [‘cr’ ‘s’]

‘JJR’ ‘IN’

‘IN’ ‘JJ’ ‘CC’ ‘s’

‘TO’ [‘s’ ‘cs’]

[‘s’ ‘cr’] ‘with’ ‘a’

‘Except’ ‘IN’

‘n’ ‘c’ ‘v’ [‘cr’ ‘s’]

‘rv’ [‘a’]

‘JJR’ ‘IN’ ‘v’ ‘u’

‘VBZ’ ‘dpvr’

‘s’ ‘Has’ ‘a’ ‘OF’ ‘c’ ‘v’ ‘u’

‘VB’ [‘cr’ ‘a’ ‘s’]

‘s’ ‘MD’ ‘VB’ ‘OF’

‘IN’ [‘cr’ ‘a’ ‘s’]

‘MD’ ‘VB’ ‘dpvr’ ‘s’

[‘u’ ‘JJR’] [^ ‘stories’]

[‘cr’ ‘a’ ‘s’] ‘MD’ ‘VB’ [‘cr’ ‘a’ ‘s’]

‘I’ ‘a’

‘s’ ‘MD’ ‘Has’ ‘s’

‘I’ ‘VBD’

‘cs’ ‘MD’ ‘Has’ ‘s’

‘I’ ‘JJ’

‘v’ ‘u’ ‘CC’ ‘JJR’

‘VBD’ ‘I’

‘s’ ‘MD’ ‘VB’ ‘dpvr’

954

Table 4: Experimental Results Using Essential Information Tags Only

955

Concepts

Relations

Total

Number of logic clause elements in gold standard

334

749

1083

Total number of logic clause elements generated

328

786

1114

Number of logic clause elements correctly generated

324

706

1030

Precision

0.988

0.898

0.925

Recall

0.970

0.943

0.951

F1 measure

0.979

0.920

0.938

956

Table 5: Experimental Results Using Both Essential and Secondary Information Tags

957

Concepts

Relations

Total

Number of logic clause elements in gold standard

570

1349

1919

Total number of logic clause elements generated

569

1367

1936

Number of logic clause elements correctly generated

568

1333

1901

Precision

0.998

0.975

0.982

Recall

0.996

0.988

0.991

F1 measure

0.997

0.982

0.986

958

Table 6: Comparative Summary of Experiment #1 and Experiment #2

959

Experiment #1

Experiment #2

Increase

Number of information tags used

+ 400%

Number of semantic mapping rules used

297

+ 460%

Number of conflict resolution rules used

- 18%

Number of logic clause elements built

1114

1936

174%

Precision

0.925

0.982

Recall

0.951

0.991

F1 Measure

0.938

0.986

960

961

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Figures

962

Figure 1. Proposed approach for automated rule extraction

963

964

Figure 2. Proposed information transformation methodology

965

966

967

968

969

970

971

972

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Figure 3. Sample sentence with information tags

973

974

Figure 4. Illustration of top-down method and bottom-up method

975

976

Figure 5. Example illustrating logic-based information representation and reasoning

977

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

978

979

980

981

982

983

984

985

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Figure 6. Semantic information tags

986

987

988

989

990

991

992

993

994

995

996

997

998

999

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Figure 7. Example illustrating the processing of a sample sentence: (a) original sentence; (b) sentence

1000

tagged with information tags; (c) information instance tuple list; (d) information instance tuple list after

1001

applying conflict resolution rules; (e) logic clause generated by consume and generate mechanism

1002

1003

1004

1005

1006

1007

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Figure 8. Pseudocode for main algorithm

1008

1009

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Figure 9. Pseudocode for consume and generate mechanism

1010

1011

1012

1013

The published version is found in the ASCE Library here: http://ascelibrary.org/doi/abs/10.1061/(ASCE)CP.1943-5487.0000427

Zhang, J. and El-Gohary, N. (2015). "Automated Information Transformation for Automated Regulatory Compliance Checking in

Construction." J. Comput. Civ. Eng., 10.1061/(ASCE)CP.1943-5487.0000427, B4015001.

Dependency Parsing-Based Information Extraction from Car Crash Narratives to Support Crash Scene Reconstruction

Conference Paper

Jan 2024

Information Integration of Regulation Texts and Tables for Automated Construction Safety Knowledge Mapping

Article

Mar 2024

The explicit safety knowledge contained in regulations in the form of texts and tables is crucial for construction safety management. However, the presence of rich semantic content within texts and the intricate layout of complex tables makes domain information extraction challenging. Therefore, this research proposed a hybrid approach to map safety knowledge graphs by automatically extracting information from both texts and tables in a scenario-oriented manner, combining rules and deep learning methods to achieve a balance between scene applicability and method flexibility. Furthermore, metrics from social network analysis (SNA) were applied to evaluate and verify the quality of the constructed knowledge graph. For extracting semantic information from text, the proposed approach supplemented the semantics information of the sentence and balanced the granularity of knowledge by combining the BERT-BiLSTM-CRF-based named entity recognition (NER) model and semantic role labeling (SRL)-based information extraction model. For irregular tables, a unified automatic extraction method was developed to process nested tables without preprocessing. The experiment constructed a comprehensive and scenario-oriented knowledge graph with 907 nodes, and showed high precision and recall for texts (89.37%, 85.42%) and tables (97.11%, 85.22%) on the test data. SNA results showed the proposed method ensured information richness and structural complexity. Practical Applications: The construction safety knowledge graph constructed in this research offers three significant practical advantages. First, the proposed framework provides a solution for automatically integrating regulations into a knowledge graph with rich semantics and comprehensive information. Considering both sentence semantics and entity granularity enhances the application of Chinese regulatory clauses to specific construction scenarios. Second, the knowledge graph incorporated both textual semantics and tabular data, which assists managers in querying more accurate and comprehensive safety requirements. The comprehensive knowledge graph allows managers to quickly locate the necessary construction requirements on a larger scale and make more comprehensive and accurate construction decisions, effectively improving work efficiency and decision-making quality. Third, metrics from SNA suggested that the proposed method maintained the amount and diversity of regulatory information, while strengthening the compactness of the community structure and providing specific and clear requirements for the construction situation, operation procedures, and threshold definition. As a result, it is easier for managers to understand and process the safety information, perform construction operations in accordance with regulatory requirements, ensure the compliance of the operation, and further improve construction safety.

GPT models in construction industry: Opportunities, limitations, and a use case validation

Article

Full-text available

Mar 2024

Large Language Models (LLMs) trained on large data sets came into prominence in 2018 after Google introduced BERT. Subsequently, different LLMs such as GPT models from OpenAI have been released. These models perform well on diverse tasks and have been gaining widespread applications in fields such as business and education. However, little is known about the opportunities and challenges of using LLMs in the construction industry. Thus, this study aims to assess GPT models in the construction industry. A critical review, expert discussion and case study validation are employed to achieve the study's objectives. The findings revealed opportunities for GPT models throughout the project lifecycle. The challenges of leveraging GPT models are highlighted and a use case prototype is developed for materials selection and optimization. The findings of the study would be of benefit to researchers, practitioners and stakeholders, as it presents research vistas for LLMs in the construction industry.

Application of Graph Convolutional Networks to Classification of Building Code Requirements

Conference Paper

Mar 2024

An Artificial Intelligence-Based Framework for Automated Information Inquiry from Building Information Models Using Natural Language Processing and Ontology

Conference Paper

Jan 2024

Az építésautomatizálás technológiai lehetőségei: Az ipar 4.0 szemlélet kibontakozása az építőiparban

Article

Full-text available

Feb 2024

Az építőipar a munkaerő hiánya és az egyre fokozódó minőségi elvárások miatt a hagyományos, jellemzően emberi erőforrást alkalmazó vagy emberek által közvetlenül működtetett technológiák irányából apró lépésenként az automatizált technológiák irányába fordul. Az ezzel együtt járó változás csak úgy lehet zökkenőmentes, ha az építőipar résztvevői aktív részesei a változási folyamatnak. A cikk az építőipar fejlődési irányait, annak problematikáját és lehetőségeit kívánja bemutatni a területtel kapcsolatos kutatások és a már alkalmazott technológiai megoldások elemzésével a közeljövőben lehetséges változások, további lehetőségek, illetve problémák feltérképezésére és megvilágítására törekedve.

A BIM-Based Development Method for digital drawing review system in the construction sector

Article

Nov 2023

Autonomous complex knowledge mining and graph representation through natural language processing and transfer learning

Article

Nov 2023
AUTOMAT CONSTR

A Text Classification-Based Approach for Evaluating and Enhancing the Machine Interpretability of Building Codes

Article

Full-text available

Jan 2024
ENG APPL ARTIF INTEL

Interpreting regulatory documents or building codes into computer-processable formats is essential for the intelligent design and construction of buildings and infrastructures. Although automated rule interpretation (ARI) methods have been investigated for years, most of them are highly dependent on the early and manual filtering of interpretable clauses from a building code. While few of them considered machine interpretability, which represents the potential to be transformed into a computer-processable format, from both clause-and document-level. Therefore, this research aims to propose a novel approach to automatically evaluate and enhance the machine interpretability of single clauses and building codes. First, a few categories are introduced to classify each clause in a building code considering the requirements for rule interpretation, and a dataset is developed for model training. Then, an efficient text classification model is developed based on a pretrained domain-specific language model and transfer learning techniques. Finally, a quantitative evaluation method is proposed to assess the overall interpretability of building codes. Experiments show that the proposed text classification algorithm outperforms the existing CNN-or RNN-based methods, by improving the F1-score from 72.16% to 93.60%. It is also illustrated that the proposed classification method can enhance downstream ARI methods with an improvement of 4%. Furthermore, analysis of more than 150 building codes in China showed that their average interpretability is only 34.40%, which implies that it is still difficult to fully transform an entire regulatory documents into computer-processable formats. It is also argued that the interpretability of building codes should be further improved both from the human side (considering certain constraints when writing building codes) and the machine side (developing more powerful algorithms, tools, etc.)..

Comparing natural language processing (NLP) applications in construction and computer science using preferred reporting items for systematic reviews (PRISMA)

Article

Oct 2023
AUTOMAT CONSTR

A qualitative analysis framework using natural language processing and graph theory

Article

Full-text available

Nov 2012

P.J. Tierney

p style="margin-bottom: 0in; line-height: 200%;">This paper introduces a method of extending natural language-based processing of qualitative data analysis with the use of a very quantitative tool—graph theory. It is not an attempt to convert qualitative research to a positivist approach with a mathematical black box, nor is it a “graphical solution”. Rather, it is a method to help qualitative researchers, especially those with limited experience, to discover and tease out what lies within the data. A quick review of coding is followed by basic explanations of natural language processing, artificial intelligence, and graph theory to help with understanding the method. The process described herein is limited by neither the size of the data set nor the domain in which it is applied. It has the potential to substantially reduce the amount of time required to analyze qualitative data and to assist in the discovery of themes that might not have otherwise been detected. </p

A study on translating regulatory rules from natural language to defeasible logic

Article

Full-text available

Jan 2013

Legally binding regulations are expressed in natural language. Yet, we cannot formally or automatically reason with regulations in that form. Defeasible Logic has been used to formally represent the semantic interpretation of regulations; such representations may provide the abstract specification for a machinereadable and processable representation as in LegalRuleML. However, manual translation is prohibitively costly in terms of time, labour, and knowledge. The paper discusses work in progress using the state-of-the-art in automatic translation of a sample of regulatory clauses to a machine readable formal representation and a comparison to correlated Defeasible Logic representations. It outlines some key problems and proposes tasks to address the problems.

Automating code checking for building designs

Article

Full-text available

Jan 2006

Semantic NLP-Based Information Extraction from Construction Regulatory Documents for Automated Compliance Checking

Article

Full-text available

Jul 2013

Automated regulatory compliance checking requires automated extraction of requirements from regulatory textual documents and their formalization in a computer-processable rule representation. Such information extraction (IE) is a challenging task that requires complex analysis and processing of text. Natural language processing (NLP) aims to enable computers to process natural language text in a human-like manner. This paper proposes a semantic, rule-based NLP approach for automated IE from construction regulatory documents. The proposed approach uses a set of pattern-matching-based IE rules and conflict resolution (CR) rules in IE. A variety of syntactic (syntax/grammar-related) and semantic (meaning/context-related) text features are used in the patterns of the IE and CR rules. Phrase structure grammar (PSG)-based phrasal tags and separation and sequencing of semantic information elements are proposed and used to reduce the number of needed patterns. An ontology is used to aid in the recognition of semantic text features (concepts and relations). The proposed IE algorithms were tested in extracting quantitative requirements from the 2009 International Building Code and achieved 0.969 and 0.944 precision and recall, respectively.

Part-of-Speech Tagging and Partial Parsing

Article

Jan 1996

S. Abney

Automated code-checking as a driver of BIM adoption

Article

Jan 2010

With the scarcity of land supply, complex high-rise buildings of more than 50 storeys Information management in the construction industry is inefficient when compared with other industrial activities. Unlike other productive activities, the construction industry is yet to develop standard formats for the representation of its products, which would allow its participants to communicate efficiently and, in some cases, automatically. Several different information models-(BIM) that represent building products partially or as a whole have been developed over the last decades. Their adoption by the community of users has been, however, scarce. It is believed that the dissemination and adoption of these models throughout the construction industry is hindered by a cooperation problem: the cumulative benefits derived from widespread BIM adoption are clearly larger than those that can be achieved through individual adoption, while the initial direct and indirect costs are considerable. The incentives for single users to change work their processes are therefore modest. In this context, automated code checking performed upon designs that follow standard representation formats is regarded not as an end in itself, but rather as a demonstration of the immediate benefits that can be obtained by the users who voluntarily adopt this kind of information technology. In this paper, an information model and an application developed at FEUP are briefly presented. These tools perform automated code-checking of domestic water systems for compliance with the main national regulations. Automated code-checking should not only provide advantages due to simplified work processes, but it should also motivate users to adopt building information models, especially in the early stages of the construction process.

An Overview of Shallow and Deep Natural Language Processing for Ontology Learning

Article

Jan 2011

Amal Zouaq

This chapter gives an overview over the state-of-the-art in natural language processing for ontology learning. It presents two main NLP techniques for knowledge extraction from text, namely shallow techniques and deep techniques, and explains their usefulness for each step of the ontology learning process. The chapter also advocates the interest of deeper semantic analysis methods for ontology learning. In fact, there have been very few attempts to create ontologies using deep NLP. After a brief introduction to the main semantic analysis approaches, the chapter focuses on lexico-syntactic patterns based on dependency grammars and explains how these patterns can be considered as a step towards deeper semantic analysis. Finally, the chapter addresses the "ontologization" task that is the ability to filter important concepts and relationships among the mass of extracted knowledge.

Analysis of twitter feeds using natural language processing and machine learning

Article

Jan 2015

Cyber bullying is a rapidly burgeoning phenomenon in to-days world dominated by the Internet. From every major incident happening around the world to meager day-to-day activities of an individual is posted on social media. Ergo, Internet has now become an essentiality that is indispensable. Though this seems intriguing, however, it has led to the advent of cyber bullying. Social networking sites provide an easy platform for the cyber bullies to identify and victimize other users. Cyber bullies may make use of victims personal data(e.g. real name, home address) to impersonate them, or by creating fake accounts in social networking sites that defames, discredits or ridicules them. Due to the anonymity of the Cyber bullies it becomes increasingly difficult for the o ender to be caught and punished for their behavior. This paper proposes a system which identifies posts which are aimed at hurting the sentiments of other users and makes the user to rethink and hence refrain from posting the same. This paper also provides an effective algorithm that identifies and reduces the spam content in the users post/tweet.

Semantic Text Classification for Supporting Automated Compliance Checking in Construction

Article

Oct 2013

Automated regulatory and contractual compliance checking requires automated rule extraction from regulatory and contractual textual documents (e.g., contract specifications). Automated rule extraction is a challenging task that requires complex processing of text. In the proposed automated compliance checking (ACC) approach, the first step in automating the rule extraction process is automatically classifying the different documents and parts of documents (e.g., contract clauses) into predefined categories (environmental, safety, health, etc.) for preparing it for further text analysis and rule extraction. These categories are defined in a semantic model for normative reasoning. This paper presents a semantic, machine learning-based text classification algorithm for classifying clauses and subclauses of general conditions for supporting ACC in construction. The multilabel classification problem was transformed into a set of binary classification problems. Different machine learning algorithms, text preprocessing techniques, methods of text feature scoring, methods of feature weighting, and feature sizes were implemented and evaluated at different thresholds. The developed classifier achieved 100 and 96% recall and precision, respectively, on the testing data. (C) 2014 American Society of Civil Engineers.

Automated Compliance Checking of Construction Operation Plans Using a Deontology for the Construction Domain

Article

Nov 2013

Automated compliance checking (ACC) in the construction domain continues to be a challenge. Current ACC systems do not provide the level of knowledge representation and reasoning that is needed to efficiently interpret applicable norms (e.g.,laws, regulations, contractual requirements, advisory practices) and check conformance of designs and operations to those interpretations. In this paper, the authors explore a new approach to ACC and propose to apply theoretical and computational developments in the fields of deontology, deontic logic, and natural language processing to the problem of compliance checking in construction. Deontology is a theory of rights and obligations, and deontic logic is a branch of modal logic that deals with obligations, prohibitions, and permissions. This paper focuses on presenting a deontology for ACC in construction. The deontic model is composed of a hierarchy of normative concepts, interconcept relations, and deontic axioms (rules represented using deontic logic). The deontology was evaluated through formal competency questions, automated consistency checking, automated redundancy checking, expert evaluation, and application-oriented evaluation. The deontic model was manually applied in checking the compliance of storm-water pollution prevention plans with applicable norms. (C) 2013 American Society of Civil Engineers.

Automated Information Transformation for Automated Regulatory Compliance Checking in Construction

Abstract

Recommended publications

Integrating semantic NLP and logic reasoning into a unified system for fully-automated code checking

Semantic NLP-Based Information Extraction from Construction Regulatory Documents for Automated Compl...

Extraction of Construction Regulatory Requirements from Textual Documents Using Natural Language Pro...

Automated Regulatory Information Extraction from Building Codes : Leveraging Syntactic and Semantic...

Natural Language Processing for Information and Project Management