ArticlePDF Available

Computer-Aided Comparison of Thesauri Extracted from Complementary Patent Classes as a Means to Identify Relevant Field Parameters

Authors:

Abstract and Figures

Patents are gaining a growing importance as a complementary source of technical information, since the information they disclose is not accessible in scientific and technical literature. Text mining technologies are emerging as a possible solution to increase the efficiency of patent analysis activities; besides, most of the existing systems are derived from general purpose applications that marginally leverage patents peculiarities. The authors are developing algorithm and tools fully dedicated to patent mining, i.e. information extraction from patent literature. The present paper aims at the identification of relevant technical parameters for a certain domain, through the comparison of thesauri automatically extracted from the given field of application and from its complementary patent classes.
Content may be subject to copyright.
Computer-Aided Comparison of Thesauri Extracted from
Complementary Patent Classes as a Means to Identify Relevant
Field Parameters
G. Cascini and M. Zini
Abstract Patents are gaining a growing importance as a
complementary source of technical information, since the
information they disclose is not accessible in scientific and
technical literature. Text mining technologies are emerging
as a possible solution to increase the efficiency of patent
analysis activities; besides, most of the existing systems are
derived from general purpose applications that marginally
leverage patents peculiarities. The authors are developing
algorithm and tools fully dedicated to patent mining, i.e.
information extraction from patent literature. The present
paper aims at the identification of relevant technical parame-
ters for a certain domain, through the comparison of thesauri
automatically extracted from the given field of application
and from its complementary patent classes.
Keywords Patent mining ·Field thesaurus ·Patent classifi-
cation ·OTSM-TRIZ ·Network of parameters ·Network of
evolutionary trends
1 Introduction
The definition of competitive R&D strategies requires mon-
itoring the evolution of technical systems in order to assess
the maturity level of current solutions and to check the emer-
gence of new technologies. Nevertheless, despite more than
fifty methodologies with different characteristics and spe-
cific purposes have been proposed so far in this field [1],
no universal methods are known. Besides, complementary
instruments must be integrated according to the specific goal
and data availability. Moreover, due to the huge amount of
G. Cascini (), M. Zini
Dipartimento di Meccanica, Politecnico di Milano, 20156 Milan, Italy
e-mail: gaetano.cascini@polimi.it
scientific and technical documentation nowadays produced
in any field of application, these analyses are extremely time
consuming.
Among the recent research developments which deserve
a proper attention to improve the efficiency of innovation
related activities, TRIZ (the Theory of Inventive Problem
Solving) is gaining popularity as a means to systematize the
analysis of a technical system and to identify opportunities
of evolution.
A promising direction of research in this area is the defi-
nition of structured models representing in a concise format
the challenges of a certain field of application in the form of
networks of contradictions as proposed in [2], or in the form
of network of evolutionary trends, as described in [3].
In both cases the domain knowledge is represented in
terms of parameters and relationships between these param-
eters; the identification of the relevant parameters and the
related links is a critical task which requires questionnaires
to subject meta-experts aimed at making their knowledge
explicit.
A complementary and extremely valuable source of infor-
mation to support these analyses is constituted by patent
databases: in facts, several studies have demonstrated that
80% of information contained in patents, is not available
in any other source [4]. A further advantage of patents is
related to their semi-structured format which allows adopting
customized text-mining techniques to improve information
extraction efficiency.
The authors are developing a set of complementary algo-
rithms for patent mining. The goal of the present article is to
show the preliminary results of a research aimed at building
a model of domain knowledge in the form of a network of
parameters. More in details, the paper details the process of
construction of a domain thesaurus through automatic patent
analysis and the criteria to compare thesauri extracted from
complementary patent classes as a means to identify relevant
field parameters.
555
A. Bernard (ed.), Global Product Development,
DOI 10.1007/978-3-642-15973-2_56, © Springer-Verlag Berlin Heidelberg 2011
556 G. Cascini and M. Zini
2 Related Art
Before than detailing the methodological approach and the
computer-based system proposed in the present paper, it is
worth to recall some fundamentals about the International
Patent Classification. This section will summarize also some
relevant outcomes of previous research activities carried out
by the authors in the field of patent text-mining. Finally, pre-
vious works related to automatic thesaurus construction will
be critically surveyed, to highlight opportunities and limits
for their application in the patent field.
2.1 International Patent Classification
The International Patent Classification (IPC) system is a lan-
guage independent hierarchical classification of patents and
utility models according to the different areas of technology
to which they pertain.
Inventions from any field are classified into 9 sections and
further subdivided into classes, subclasses, main groups and
subgroups (5th and lower levels).
The primary purpose of IPC is supporting patent docu-
ments retrieval, in order to establish the novelty and evaluate
the inventive step or non-obviousness of technical disclo-
sures in patent applications. The Classification, furthermore,
has the important purposes of serving as [5]:
i. an instrument for the orderly arrangement of patent doc-
uments in order to facilitate access to the technological
and legal information contained therein;
ii. a basis for selective dissemination of information to all
users of patent information;
iii. a basis for investigating the state of the art in given fields
of technology;
iv. a basis for the preparation of industrial property statis-
tics which in turn permit the assessment of technological
development in various areas.
According to the third objective of IPC, it is assumed
that the patents belonging to a specific class constitute a
meaningful sample of documents from where to extract the
terminology of a certain field of application and the main
technical parameters of such technological area.
2.2 The PAT-Analyzer Project
The authors are working on the development of new tech-
niques and algorithms for patent analysis and comparison
[68]. As a result of these previous experiences a prototype
software system (named PatAnalyzer) has been developed
with the following functionalities:
identify the components of the invention;
classify the identified components in terms of detail/
abstraction level and their compositional relationships in
terms of supersystem/subsystem links;
identify positional and functional interactions between the
components both internal and external to the system;
identify the most relevant components of each patent for
a given project according to a ranking criterion which
combines the detail level of the description with compo-
nents’ occurrences in patent claims and with the Inverse
Document Frequency, i.e. the “rarity” of each synset of
the Thesaurus.
2.3 Automatic Thesaurus Construction
The word thesaurus derives from Greek and Latin and means
“treasury or storehouse; hence, a repository, especially of
knowledge; often applied to a comprehensive work, like a
dictionary or encyclopedia”. Numerous definitions of the-
sauri exist across fields such as computer science, artificial
intelligence and library and information science [911].
They vary from quite modest definitions that do not spec-
ify types of conceptual relations, to more specific definitions
that clearly define the conceptual relations. In [12] there is
an example of a modest definition: “we define a thesaurus
as simply a mapping from words to other closely related
words”. In contrast, Miller gives a more elaborate definition
of a thesaurus as “a lexical-semantic model of a concep-
tual reality or its constituent, which is expressed in the form
of a system of terms and their relations, offers access via
multiple aspects and is used as a processing and search-
ing tool of an information retrieval unit” [13]. The ISO
2788:1986 (Guidelines for the establishment and develop-
ment of monolingual thesauri) standardizes Thesauri defin-
ing it as a “vocabulary of a controlled indexing language,
formally organized so that a priori relationships between con-
cepts (for example as ‘broader’ and ‘narrower’ are made
explicit”). For the purpose of this work, it is adopted the
broader definition: a thesaurus is defined as a structured
system of concepts identified by collections of terms and
hierarchical relationships between these concepts.
Manual thesaurus construction is a huge, time-consuming
task of term selection, conceptual analysis and relational
structuring of concepts and terms [10]; moreover, it is
subjected to problems of bias, inconsistency and limited
coverage.
In addition, thesaurus compilers cannot keep up with con-
stantly evolving language use and cannot afford to build new
Computer-Aided Comparison of Thesauri 557
thesauri for the many sub-domains that NLP techniques are
being applied to.
There is a clear need for methods to extract thesauri auto-
matically or tools that assist in the manual creation and
updating of these semantic resources.
Methods for automatic thesaurus extraction can be
roughly divided in two categories: Statistical methods;
Linguistic patterns methods.
Statistical methods rely on the observation that seman-
tically related terms will appear in similar contexts. These
systems differ primarily in their definition of context (e.g.
window of text, sentence, paragraph, grammatical context,
entire document) and the way they calculate similarity from
the contexts each term appears [14]. The simplest contexts
to extract are the words surrounding term up to some fixed
distance. Some approaches take the whole document as the
context and consider term co-occurrence at the document
level.
In [15] grammatical relations are extracted such as:
term is subject of a verb;
term is the (direct/indirect) object of the verb;
term is modified by noun or adjective;
term is modified by a prepositional phrase.
The relations for each term are then collected and counted
producing a context vector for each term. Once these con-
texts have been defined, these systems define measures of
similarity between context vectors and then use clustering or
nearest neighbor methods to find related terms.
Linguistic pattern methods are based on the observation
that patterns of co-occurring terms carry information about
their semantic relations. These systems extract related terms
directly by recognizing linguistic patterns which connect
synonyms and hyponyms [16,17]. In the pioneering work
of Hearst [16], the use of linguistic patterns was suggested to
discover hyponymy relations from unstructured text.
Patterns like
such NP as {NP, }∗{(or|and)}NP
as in: “Works by such authors as Herrick, Goldsmith, and
Shakespeare”, or like
NP {,NP}∗{,}or other NP
as in: “Bruises, wounds, broken bones or other injuries”
can be used to extract hyponymy relations. From the exam-
ples it is possible to infer that “Herrick”, “Goldsmith” and
“Shakespeare” are all hyponyms of the term “author” and
“bruise”, “wounds” and “broken bone” are “injuries”.
Previously described methods have a general purpose
approach, relying only on mere text, without any other infor-
mation available on the ontological structure of the concepts
to be extracted and organized.
In the work by Shinzato [18] itemizations in HTML docu-
ments taken from the Web are exploited to identify hyponym
candidate sets, statistical measures and heuristics are then
used to select actual hyponyms.
The last described work suggests that, where available,
the information conveyed by the peculiar structure of the
analyzed document can be exploited.
The approach proposed in this work leverages patent
structure and pattern of text to provide a semi-automated
thesaurus generation system.
Focusing on invention components denominations as the
thesaurus terms here identified, it is possible to exploit the
semantic information conveyed with alternative denomina-
tion sets as defined in Sect. 3.1 to discover synonymy and
hyponymy relations. The proposed approach is described in
the following section.
3 Thesaurus Construction and Comparison
The present chapter is subdivided in two subsections, the first
focused on the original algorithm developed by the authors
for computer-aided thesaurus construction, the second details
the proposed procedure to compare thesauri extracted from
complementary patent classes with the aim of identifying the
main technical parameters of their related fields.
More in details, with the aim of building a model of
domain knowledge according to any of the approaches
described in [2] and in [3], it is necessary to identify two
different kinds of parameters:
Evaluation Parameters, i.e. parameters to measure the
level of satisfaction of system requirements;
Control Parameters, i.e. any kind of design variable, prop-
erty or feature controllable by the designer, which might
impact on at least one Evaluation Parameter.
Control Parameters and Evaluation Parameters related to a
specific technical field will be referred as domain parameters
hereafter in the paper.
3.1 Semi-Automated Thesaurus Construction
Most of text mining systems applied to patent analysis suf-
fer the influence of the language style and the terminology of
the writer; in other terms, when different inventors adopt dif-
ferent terms or expressions to describe the same components
558 G. Cascini and M. Zini
and functions, existing text mining application are rarely able
to identify the existing semantic link between these concepts.
As described in [6], the authors identify the components of
an invention by means of their reference characters, accord-
ing to the universal patent writing rule which claims that “the
same part of an invention appearing in more than one view
of the drawing must always be designated by the same refer-
ence character, and the same reference character must never
be used to designate different parts” [19].
According to this rule, different denominations associated
to the same part, must be semantically related at least within
the given patent text. Besides, when comparing two different
patents, it is necessary to identify if component xof patent
Xis to be considered the same as component yof patent Y.
Chances are that the two components have different names in
different patents while referring to the same type of object.
In order to be able to compare components between dif-
ferent patents it is required to build a component denomina-
tions thesaurus, which defines concepts as sets of synonyms
(synsets) and hierarchical semantic relationships (hyponymy
and hypernymy) between those concepts. The proposed
approach to semi-automatically build such a thesaurus lever-
ages the extracted components and their alternative denomi-
nations, which are then processed through a heuristic of text
patterns to identify synonymy and hyponymy relationships.
3.1.1 Alternative Denominations
In order to provide an unambiguous description of the algo-
rithm for thesaurus construction it is helpful to introduce
some formal definitions.
Denomination dkof a component kis a word, or a set of
words, that denotes the component kin the patent text.
Alternative denominations set Ak,p={d1,d2,...,dn}of
a component kis defined as the set of ndenomina-
tions referring to the same component kwithin the patent
p(an exemplary set of alternative denominations for a
few components of patent US 5.328.488 is shown in
Table 1).
The set of all the denominations sets extracted from every
component and every patent in a given invention set Iis
referred as AI. Finally, it is defined DIas the set of all the
component denominations extracted from I.
3.1.2 Synonymy, Hyponymy and Hypernymy
Synonymy is usually defined as different lexemes with the
same meaning leaving open the question of what it means to
have the same meaning. If it were to be applied to any context
a few words would be true synonyms.
Tab le 1 Patent US 5.328.488 “Laser light irradiation apparatus for
medical treatment” excerpt from the list of components and their
alternative denominations
Component ref.
character Alternative denominations
Laser light
transmissive probe 1
laser light transmissive probe; probe;
right side laser light transmissive
probe; opposite laser light
transmissive probe; laser light
penetrating probe; transmissive
probe; light transmissive probe;
penetrating probe
optical fiber 8 optical fiber; single optical fiber
holder;
particle 20 particle; laser light scattering particle;
scattering particle
laser light emitting
portion 54a
laser light emitting portion; flat
emitting portion
According to [20] the notion of substitutability has been
adopted: two lexemes will be considered synonyms if they
can be substituted for one another in a sentence without
changing either the meaning or the acceptability of the
sentence.
So, for the same purpose, it is assumed that two nomi-
nal syntagms (either formed by a single word or by several
words) are synonyms if they are substitutable in some envi-
ronment. The environment will be that of an invention set
disclosed in a given patent corpus I. Hyponymy is the relation
between two lexemes that holds when one lexeme denotes
a subclass of the other. Ais said to be Hyponym of B,if
Bdenotes a more general class; in this case Bis said to be
Hypernym of A. Thus, car is a hyponym of vehicle and vehi-
cle is hypernym of car [20]. Hereafter, this kind of relations
will be considered as a generalization relation. Hence a the-
saurus built according to the algorithm described below will
always refer to a single patent corpus Iand will be composed
of synsets, every synset being a set of nominal syntagms rep-
resenting component denominations considered synonyms in
the context of the said invention set. A synset can be thought
as a single concept described by the denominations it con-
tains and representing a common meaning or sense for those
denominations.
3.1.3 Co-occurrence Graph
To represent the information about component denomina-
tions conveyed by alternative denominations set, it is pro-
posed to use an undirected weighted graph defined as fol-
lows: the co-occurrence graph as GI=(V,E) where V=
{d1,d2...dn}is the set of all the component denominations
DIdefined above.
Let AIbe the set of all the alternative denominations set
found in the invention set Iand let Ak,pAr.
Computer-Aided Comparison of Thesauri 559
An edge eEbetween node diand node djexists if
Ak,pAI:di,djAk,p.
Let Wbe the weight function on G;W:(E)(NxN):
the weight of the edge eE,W(e)=(w1,w2) is cal-
culated such that w1represents the number of alternative
denominations in which diand djco-occur (in one or more
patents of the corpus), and w2represents the number of
different patents in which this happens.
3.1.4 Component Denominations Thesaurus
According to the definition adopted for the present work
mentioned in Sect. 2.3, the thesaurus can be represented
as a directed graph, in which nodes represent synsets and
directed edges represent generalization relations. As shown
below, not only the internal representation of the thesaurus
is a graph, but it is possible also to represent it graphically
in the user interface, allowing for a clear representation of
the conveyed information. The user is also able to interact
with the graph to modify it, in order to correct errors of the
algorithm or to add or modify relationships that could not be
discovered by the system automatically.
In order to build such a thesaurus the following algorithm
is applied.
1Co-occurrence graph construction
Component denominations and alternative denominations
sets are extracted from the entire patent corpus Iand a
co-occurrence graph is built according to the definitions
provided above.
2. Generalization edges transformation
The first step to build the final thesaurus graph consists
in the transformation of co-occurrence edges in gener-
alization edges. If two co-occurring denominations are
in a generalization relation the corresponding edge is
transformed in a generalization edge. To identify gener-
alizations a simple heuristic is proposed: if component
denomination dico-occurs with denomination dj(there is
an edge in the co-occurrence graph) and if didjthen di
is considered to be hypernym of dj.
3. Synsets merging
Once every generalization relationship has been trans-
formed, co-occurring denominations are merged in synset
nodes. If a merge operation leads to inconsistency (a cycle
would be created in the graph) the nodes are not merged
and the co-occurrence edge is left for the user to dis-
ambiguate. To reduce the number of inconsistencies, the
merging algorithm merges edges in an ordered way. Edges
are ordered for increasing number of words of the con-
nected nodes and decreasing edge weight. The merging
operation starts from the smallest denominations and the
highest weights.
4. User disambiguation
In this step the user disambiguates the remaining co-
occurrence edges, corrects wrong relations and eventually
merges or separates synsets according to his/her specific
knowledge of the field.
An exemplary excerpt from a thesaurus construction task
is shown in Figs. 1,2,3,4,5, and 6. The first (Fig. 1) rep-
resents the co-occurrence graph (step 1). In Fig. 2edges
representing generalizations have been transformed (step 2);
notice the arrow that points to the hyponym.
In Fig. 3single word co-occurrences have been merged
(step 3). It is worth to note that in this trivial example,
constituted by a small number of patents and a subset of
components, no threshold has been used to merge the nodes;
in a general case, synsets can be created by merging only
the nodes whose link overcome a minimum number of alter-
native denominations in which the nodes co-occur and/or a
minimum number of patents in which the co-occurrence is
found.
In Fig. 4two words co-occurrences have been merged.
Notice that the edge Acannot be merged since there is
already a generalization edge connecting the synsets. In
arc quenching core
core
2;2
2;2
1;1
1;1
1;1
4;4
4;4
A
4;4
4;4
4;4
coil
solenoid coil electromagnetic solenoid
solenoid
movable core
Fig. 1 Edges represent
co-occurrence of denominations
in alternative denominations set;
notice the weight pair on each
edge, the left number represents
the number of alternative
denominations set in which the
nodes co-occur, the number on
the right represents the number of
patents in which the
co-occurrence happens
560 G. Cascini and M. Zini
solenoid coil
arc quenching core
core
1;1
1;1
1;1
4;4
4;4
coil
electromagnetic solenoid
solenoid
movable core
A
Fig. 2 Generalizations
identification. Generalization
relations have been identified
and transformed
arc quenching core
coil,
solenoid,
core
movable core
electromagnetic solenoid
solenoid coil
1;1
1;4
Fig. 3 Denominations merging step 1. Nodes composed of one word
have been merged, notice that the edge between (coil, solenoid, core)
and (solenoid coil) cannot be merged since there is already a general-
ization relationship
arc quenching core
coil,
solenoid,
core
A
movable core
solenoid coil,
electromagnetic
solenoid
1;1
Fig. 4 Denominations merging step 2. Nodes (solenoid coil) and
(electromagnetic coil) are merged. The user should disambiguate co-
occurrence edge left
Fig. 5the edge Abetween {coil, solenoid, core} and
{solenoid coil, electromagnetic solenoid} has been disam-
biguated, the expert considers solenoid, solenoid coil, coil
and electromagnetic solenoid as synonyms to the extent of
this invention set.
coil,
solenoid, core,
solenoid coil,
electromagnetic
solenoid
arc quenching core movable core
Fig. 5 Disambiguation: the user has chosen to disambiguate the edge
eliminating the generalization relation, this leads to the merging of the
two nodes
coil,
solenoid,
solenoid coil,
electromagnetic
solenoid
arc quenching core
core
movable core
Fig. 6 The final thesaurus. Notice that the user has chosen to manually
separate core from the synset (core, coil, solenoid coil, electromagnetic
solenoid
In Fig. 6the final thesaurus is represented. Notice that the
expert has chosen to separate {core} from the synset {coil,
solenoid coil, solenoid, electromagnetic solenoid)} since it
cannot be considered a synonym even to the extent of this
inventions set. Notice also that this error could have been
easily avoided choosing a higher threshold for merging, since
the edge A connecting {core} to the other denominations had
a weight of (1,1), this means that this denominations where
co-occurring only in one patent.
Computer-Aided Comparison of Thesauri 561
3.2 Thesauri Comparison and Field
Parameters Identification
The algorithm described in the previous section allows to
build a thesaurus related to a given corpus of patents. It is
evident that, due to the adopted criteria, the robustness of the
process increases with the uniformity of the corpus contents.
In other terms, the reliability of the thesaurus is higher if the
analysis is limited to document belonging to the same class
and even more if it is focused on a specific sub-class or even
a patent group.
Moreover, it is interesting to observe that in most cases the
IPC classification, especially for well established products
and processes, even if not purposefully, is structured accord-
ing to a Function-Behavior-Structure hierarchy, such that top
level classes distinguish different functions or sets of func-
tions within a given domain, while deeper branches as groups
and subgroups are more related to alternative behaviors and
structures to deliver the same function. For example, the
class D06F covers domestic or laundry devices for washing,
rinsing and dry-cleaning textile articles (Function). Within
this class the group D06F 23/00 is related to “Washing
machines with receptacles, e.g. perforated, having a rotary
movement, e.g. oscillatory movement, the receptacle serv-
ing both for washing and centrifugally draining” (Behavior).
The sub-groups D06F 23/02, D06F 23/04, D06F 23/06
distinguish between “rotating or oscillating about a horizon-
tal/vertical/inclined axis” respectively (Structure).
The main idea of the present work for domain parameter
identification, as defined in paragraph 3, is that a compari-
son between thesauri extracted from specific IPC subgroups,
belonging to the same patent class, should highlight common
terms mostly related to the main function of the technical
system. Besides, it is assumed that the most characterizing
differences between thesauri extracted from complementary
IPC subgroups are related to the way the function is deliv-
ered, i.e. to the behavior and the structure of the related
inventions (Fig. 7).By analyzing these attributes from each
thesaurus, and more specifically the hyponymy-hypernymy
chains in order to extract adjectives and appositions from the
hyponyms, it is possible to provide to the user a list of terms
closely connected to the features governing the functioning
of the system. In facts, as it will be shown in the following
section, it is easy to extract from this list of terms a set of
relevant domain parameters.
At the current level of development, this last step is still
in charge of the user; nevertheless, the speed of the process
makes this task much more efficient than a traditional manual
investigation of the relevant design parameters.
Since the thesauri extracted according to the algorithm
described in Sect. 3.1 can be constituted by hundreds if not
thousands of entries, it is suggested to prioritize the analysis
Fig. 7 Comparison of thesauri extracted from IPC groups and sub-
groups of a same patent class
by taking into account the terms containing the keywords
belonging to the analyzed IPC classes; then, by browsing
the thesaurus network through the hypernym/hyponym links,
it is possible to build a list of adjectives and appositions
from where the user can easily extract relevant technical
parameters of the technical field under study.
At the present level of the research, no robust direc-
tions have been identified to distinguish, within the domain
parameters set, between Evaluation and Control Parameters;
therefore the classification is in charge of the patent
analyst.
4 Exemplary Application: Technologies
for Water Purification
In order to clarify the proposed comparison procedure, this
chapter describes an exemplary application in the field of
water purification through different technologies.
The most relevant International Patent Class related to
this function is the C02F (Treatment of water, waste water,
sewage, or sludge), which is subdivided into:
C02F-1 (Treatment of water, waste water, or sewage);
C02F-3 (Biological treatment of water, waste water, or
sewage);
C02F-5 (Softening water; Preventing scale; Adding scale
preventatives or scale removers to water, e.g. adding
sequestering agents);
C02F-7 (Aeration of stretches of water);
C02F-9 (Multistep treatment of water, waste water or
sewage).
562 G. Cascini and M. Zini
acid
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
acid reservoir
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
reservoir tank
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
distilled water reservoir
A: 2
B: 2
avgC: 2.0 maxC: 2
avgD: 2.0 maxD: 2
filtrate reservoir
A: 2
B: 1
avgC: 2.0 maxC: 2
avgD: 1.0 maxD: 1
fog-laden water reservoir
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
gray water reservoir
A: 2
B: 1
avgC: 2.0 maxC: 2
avgD: 1.0 maxD: 1
tanks
A: 94
B: 46
avgC: 94.0 maxC: 94
avgD: 46.0 maxD: 46
reservoir
A: 19
B: 14
avgC: 19.0 maxC: 19
avgD: 14.0 maxD: 14
Fig. 8 Excerpt from the
thesaurus graph automatically
built by processing 150 patents
belonging to the class C02F-1/02
As mentioned in Sect. 2.1, each of these classes is fur-
ther subdivided into full digit classes, related to alternative
specific technologies (behaviors) to deliver the main func-
tion. For example, the treatment of water (C02F-1) can be
operated by:
Heating (C02F-1/02);
Freezing (C02F-1/22);
Flotation (C02F-1/24);
Sorption (C02F - 1/28);
Irradiation (C02F-1/30);
Centrifugal separation (C02F-1/38);
and others...
A thesaurus has been automatically built for each of these
classes through the steps described in Sect. 3.1, by analyz-
ing all the patents granted by the United States Patent Office
between 1971 and August 2009 and by the European Patent
Office between 1980 and August 2009.
In this case study, the threshold level for automatic
acceptance of the semantic relationships (synonymy, hyper-
nymy/hyponymy) have been set as (3, 2), according to the
weight definition given in Sect. 3.1. It means that only
the semantic links appearing in at least 3 different compo-
nents and in at least 2 different patents have been stored
in the thesaurus. In order to demonstrate the efficiency of
the proposed algorithms, neither manual disambiguation,
nor manual integration of semantic relationships have been
applied. It is clear that a thesaurus improved through the con-
tribution of a subject meta-expert would provide a richer set
of information.
An exemplary excerpt from the thesaurus graph related
to the class C02F-1/02 (Treatment of water by Heating) is
shown in Fig. 8. In this example, each synset is constituted
just by one syntagm (single or multiword). The parameters
showed in the synset boxes are:
A=number of components where the synset occurs;
B=number of patents where the synset occurs;
C=number of components where each alternative
denomination of the synset occurs (average and max
value);
D=number of patents where each alternative denomina-
tion of the synset occurs (average and max value).
Computer-Aided Comparison of Thesauri 563
As stated in the previous section, the hyponyms related
to a given term are characterized by attributes that can be
associated to parameters which qualify the given term. From
the example in Fig. 8, the patent analyst can deduce with
no efforts (i.e. without reading any patent document) that
reservoirs for water treatment by heating can be classified
according to their Control Parameter “content”, which can
assume the following values:
acid;
distillate water;
filtrate;
fog-laden;
gray water.
Indeed, it must be observed that not necessarily the
attributes related to the same noun can be assumed as
different values of the same parameter. For example, in the
class C02F-1/28 the noun “reservoir” has the attribute “sol-
vent”, which in fact is a possible value of the parameter
“content”; but also “feed” which can be interpreted as a value
“feed” of the control parameter “function”. Therefore, the
interpretation of the parameters must still be done by the
patent analyst, as well as the association of the attributes as
possible values of each parameter.
Besides, the authors are investigating the possibility to
increase the level of automation of the algorithm by connect-
ing the analysis to a general purpose thesaurus as proposed
also in [21]. For example, “acid”, “water” and “filtrate”
share the same direct and indirect hypernyms: “chemical,
chemical substance”, “material, stuff”, “substance”, “mat-
ter”. Thus, the analysis of hypernymy chains might help
distinguishing between values related to different parameters
filter
activated carbon filter layer
A: 1
B: 1
unactivated carbon filter
A: 1
B: 1
avgC: 1.0 maxC:1
avgD: 1.0 maxD:1
activated carbon filter
A: 1
B: 1
avgC: 1.0 maxC:1
avgD: 1.0 maxD:1
sterilizing filter
A: 7
B: 7
avgC: 7.0 maxC: 7
avgD: 7.0 maxD: 7
pressure-sensitive filter
A: 1
B: 1
A: 28
B: 21
avgC: 28.0 maxC: 28
avgD: 21.0 maxD: 21
cartridge filter
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
filter/settling grid,
lining,
volatile grid
A: 5
B: 3
maxC: 2
maxD: 2
particulate filter
A: 4
B: 4
C: 4
D: 4
leukocyte reduction filter
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
aluminum filter
A: 1
B: 1
C: 1.0 maxC: 1
D: 1.0 maxD: 1
bubble-removing filter
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
hydrocarbon absorption filter,
hydrocarbon adsorption filter
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
biofilter system
A: 1
B: 1
avgC: 1
avgD: 1
filter element,
filter material,
filter media
A: 8
B: 6
avgC: 4.3333335 maxC: 7
avgD: 3.6666667 maxD: 5
activated aluminum filter
A: 1
B: 1
avgC: 2.0 maxC: 2
avgD: 1.0 maxD: 2
ceramic filter material
A: 4
B: 4
avgC: 4.0 maxC: 4
avgD: 4.0 maxD: 4
backflushable biofilter system
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
Fig. 9 Excerpt from the thesaurus graph automatically built by processing 341 patents belonging to the class C02F-1/24
564 G. Cascini and M. Zini
and possibly also to identify the categories of the parameters
themselves.
As claimed in the previous section, it is interesting to
compare the attributes assigned to the same item in alter-
native technical systems, i.e. the hyponyms sets extracted
from complementary patent classes. In facts, the differences
help revealing peculiarities and can be proposed to the patent
analyst as triggers for identifying the most characteristic
technical parameters.
For example, let’s consider Figs. 9and 10 representing
the direct hyponyms sets of the item “filter” in the classes
C02F-1/24 and C02F-1/28 (water treatment by flotation
and by sorption). A technician, even without reading any
patent from those classes, can identify with minimal efforts
the parameters and values reported in the Tables 2and 3.
Notsurprisingly, flotation systems explicitly cover a wider
range of applications as revealed by the evaluation param-
eters related to the object of the filtering action. Moreover,
several action principles have been identified.
Besides, sorption-based systems are characterized by dif-
ferent geometries and properties related to their operating
conditions.
By navigating the hypernyms/hyponyms links starting
from the keywords extracted by the IPC classes/subclasses
titles, it is possible to collect a comprehensive set of
parameters and values as a support action for building a
model of the domain under analysis.
activated carbon filter
A: 2
B: 2
avgC: 2.0 maxC: 2
avgD: 2.0 maxD: 2
hollow cylindrical filter
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
carbon filter
A: 2
B: 2
avgC: 2.0 maxC: 2
avgD: 2.0 maxD: 2
carbon block filter
A: 2
B: 2
avgC: 2.0 maxC: 2
avgD: 2.0 maxD: 2
sediment filter
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
immersible filter
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
replaceable torroidal shaped filter
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
carbon block
A: 2
B: 2
avgC: 2.0 maxC: 2
avgD: 2.0 maxD: 2
water filter
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1 sand filter
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
replaceable filter
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
filter part
A: 1
B: 1
avgC: 1.0 maxC: 1
avgD: 1.0 maxD: 1
cylindrical filter
A: 2
B: 2
avgC: 2.0 maxC: 2
avgD: 2.0 maxD: 2
filter
A: 25
B: 16
avgC: 25.0 maxC: 25
avgD: 16.0 maxD: 16
Fig. 10 Excerpt from the thesaurus graph automatically built by processing 150 patents belonging to the class C02F-1/28
Computer-Aided Comparison of Thesauri 565
Tab le 2 Exemplary parameters for the item “filter” and related values
which can be manually extracted from the hyponym set of Fig. 9
Parameter Values
Action principle/Filter material Activated Carbon (filter)
Action principle/Filter material Unactivated Carbon (filter)
Action principle/Filter material Aluminum (filter)
Action principle/Filter material Activated aluminum (filter)
Action principle/Filter material Ceramic (filter)
Action principle/Filter material Biofilter
Action principle/Filter material Absorption/adsorption
Object of filter action Particulate
Object of filter action Leukocyte
Object of filter action Bacteria (Sterilizing)
Object of filter action Bubble
Object of filter action Hydrocarbon
Maintenance/Cleanability Backflushable
Tab le 3 Exemplary parameters for the item “filter” and related values
which can be manually extracted from the hyponym set of Fig. 10
Parameter Values
Action principle/Filter material Carbon (filter)
Action principle/Filter material Activated Carbon (filter)
Action principle/Filter material Sand (filter)
Shape Cylindrical
Shape Hollow cylindrical
Shape Toroidal
Working Environment Immersible
Maintenance/Replaceability Replaceable
5 Conclusions and Further Developments
This present paper addresses the goal of reducing time and
efforts necessary to gather domain information from patent
analysis. The specific objective is to speed up the identifica-
tion of domain technical parameters (Evaluation and Control
Parameters) relevant for a given field of application. These
sets of parameters can be used either for creating a gen-
eral purpose domain Knowledge Base, or for mapping the
key problems to be addressed in a given field of application
[2], or even for supporting evolutionary analyses of technical
systems [3].
The authors, on the base of their past experiences in the
field of patent text mining, are studying the possibility to
identify domain technical parameters through the compar-
ison of the thesauri extracted from complementary patent
classes. At the current stage of development, the proposed
approach allows to provide to the patent analyst a set of
attributes for each relevant element of the technical system,
from which the extraction of evaluation and control param-
eters is a quite efficient task, in any case much faster than
any approach based on questionnaire to experts or manual
patent reading. Nevertheless, the applications performed so
far don’t allow to estimate the completeness of the domain
coverage: it is assumed that the attributes and qualifications
reported in the patents belonging to a certain technical field
cover all the domain parameters.
The identification of the technical parameters can be fur-
ther automated by exploiting available information as seman-
tic relationships in general purpose thesauri or the location in
the patent text (e.g. parameters extracted from the claims are
essentially related to design choices, i.e. control parameters).
Besides, the first attempts to identify also the relationships
between the parameters have revealed that further analyses
are needed to recognize general patterns to be formalized in
terms of algorithmic rules.
The proposed semi-automatic approach to build a the-
saurus for a specific patent class and to extract relevant
domain parameters has been clarified by means of an exam-
ple in the field of water treatments, where six alternative tech-
nologies have been analyzed. The promising results obtained
so far suggest to investigate with further case studies the
validity of the proposed algorithms and the opportunities for
further development.
Acknowledgments The authors would like to thank Niccolò Becattini
and Walter D’Anna from Politecnico di Milano for their contribution to
patents search and analysis.
References
1. Porter, A.L., et al. (2004) Technology future analysis: Toward inte-
gration of the field and new methods. Technological Forecasting &
Social Change, 71:287–303.
2. Cavallucci, D., Eltzer, T. (2007) Parameter network as a mean
for driving problem solving process. International Journal of
Computer Application Technology, 30(1/2):125–136.
3. Cascini, G., Rotini, F., Russo, D. (2009) Functional mod-
eling for TRIZ-based evolutionary analyses. Proceedings of
the International Conference on Engineering Design, ICED09,
Stanford University, Stanford, CA, USA, 24-27 August.
4. Bregonje, M. (2005) Patents: A unique source for scientific tech-
nical information in chemistry related industry? World Patent
Information, 27(4):309–315.
5. Guide to the IPC (version 2009) Available at http://www.wipo.
int/classifications/ipc/en/general/.
6. Cascini, G., Neri, F. (2004) Natural language processing for
patents analysis and classification. Proceedings of the 4th TRIZ
Future Conference, Florence, 3–5 November, published by Firenze
University Press, ISBN 88-8453-221-3.
7. Cascini, G., Russo, D., Zini, M. (2007) Computer-aided patent
analysis: Finding invention peculiarities. Proceedings of the
2nd IFIP Working Conference on Computer Aided Innovation,
Brighton, MI, USA, 8-9 October, published on “Trends in
Computer-Aided Innovation”, Springer, ISBN 978-0-387-75455-
0, pp. 167–178.
8. Cascini, G., Russo, D. (2007) Computer-aided analysis of patents
and search for TRIZ contradictions. International Journal of
Product Development, Special Issue: Creativity and Innovation
Employing TRIZ, 4:1–2.
566 G. Cascini and M. Zini
9. Shapiro, S. (1992) Encyclopedia of Artificial Intelligence, vol. 2.
Wiley, New York, NY.
10. Aitchison, J., Gilchrist, A., Bawden, D. (2000) Thesaurus
Construction and Use: a Practical Manual. ASLIB, London.
11. Schneider, J.W. (2005) Verification of bibliometric methods’
applicability for thesaurus construction. SIGIR Forum, 39(1):
63–64.
12. Schutze, H., Pedersen, J. (1997) A co-occurrence-based the-
saurus and two applications to information retrieval. Information
Processing and Management, 33(3):307–318.
13. Miller, U. (1997) Thesaurus construction: Problems and their
roots. Information Processing and Management, 33(4):481–493.
14. Curran, J., Moens, M. (2002) Improvements in automatic thesaurus
ex-traction. Proceedings of the Workshop on Unsupervised Lexical
Acquisition. Unsupervised Lexical Acquisition: Proceedings
of the Workshop of the ACL Special Interest Group on
the Lexicon (SIGLEX), Philadelphia, July 2002, pp. 59–66.
Association for Computational Linguistics. Paper available at:
http://www.aclweb.org/anthology/W/W02/W02-0908.pdf.
15. Curran, J. (2002) Ensemble methods for automatic thesaurus
extraction. Proceeding EMNLP ’02 Proceedings of the ACL-02
conference on Empirical methods in natural language processing,
Vol. 10, pp. 222–229, doi: 10.3115/1118693.1118722.
16. Hearst, M. (1992) Automatic acquisition of hyponyms from
large text corpora. Proceedings of the 14th Conference
on Computational linguistics, vol. 2, Association for
Computational Linguistics, Morristown, NJ, USA, pp. 539–545.
17. Caraballo, S. (1999) Automatic construction of a hypernym-
labeled noun hierarchy from text. Proceedings of the 37th annual
meeting of the Association for Computational, Morristown, NJ,
USA, pp. 120–126.
18. Shinzato, K., Torisawa, K. (2004) Acquiring hyponymy rela-
tions from web documents. Proceedings of the Human
Language Technology Conference of the North American
Chapter of the Association for Computational Linguistics:
HLT-NAACL 2004. Available at: http://acl.ldc.upenn.edu/hlt-
naacl2004/main/pdf/103_Paper.pdf
19. Consolidated Patent Rules: Title 37 Code of Federal
Regulations Patents, Trademarks, and Copyrights. Available at
http://www.uspto.gov/web/offices/pac/mpep/consolidated_rules.pdf
(last access October 2009).
20. Jurafsky, D., Martin, J.H. (2000) Speech and Language Processing:
An Introduction to Natural Language Processing, Computational
Linguistics, and Speech Recognition. In: Russell, S., Norvig, P.
(Eds.) Prentice Hall series in artificial intelligence. University of
Colorado, Boulder, Upper Saddle River, NJ: Prentice Hall, Vol.
Xxvi, 934 pp, hardbound, ISBN 0-13-095069-6, $64.00.
21. Lee, S., Huh, S.Y., McNiel, R.D. (2008) Automatic genera-
tion of concept hierarchies using WordNet. Expert Systems with
Applications, 35(3):1132–1144.
... A cooccurrence graph is a powerful technique to represent semantic relations. Hence it is used to identify relevant technical parameters for a certain domain by comparing thesauri automatically extracted from patents (Cascini et al., 2011). A more generalized semantic network TechNet is specifically trained on larger technology related data sources compared to WordNet and ConceptNet, which are more broad based and lesser engineering centric (Sarica and Luo, 2021). ...
Article
Full-text available
Representation of design information using causal ontologies is very effective for creative ideation in product design. Hence researchers created databases with models of engineering and biological systems using causal ontologies. Manually building many models using technical documents requires significant effort by specialists. Researchers worked on the automatic extraction of design information leveraging the computational techniques of Machine Learning. But these methods are data intensive, have manual touch points and have not yet reported the end-to-end performance of the process. In this paper, we present the results of a new method inspired by the cognitive process followed by specialists. This method uses the Knowledge Graph with Rule based reasoning for information extraction for the SAPPhIRE causality model from natural language texts. Unlike the supervised learning methods, this new method does not require data intensive modelling. We report the performance of the end-to-end information extraction process, which is found to be a promising alternative.
... A co-occurrence graph is a powerful technique to represent semantic relations. Hence it is used to identify relevant technical parameters for a certain domain by comparing thesauri automatically extracted from patents (Cascini et al., 2011). A more generalized semantic network TechNet is specifically trained on larger technology related data sources compared to WordNet and ConceptNet, which are more broad based and lesser engineering centric (Sarica and Luo, 2021). ...
Preprint
Full-text available
A Knowledge Graph and Rule based Reasoning Method for Extracting SAPPhRE Information from Text Representation of design information using causal ontologies is very effective for creative ideation in product design. Hence researchers created databases with models of engineering and biological systems using causal ontologies. Manually building many models using technical documents requires significant effort by specialists. Researchers worked on the automatic extraction of design information leveraging the computational techniques of Machine Learning. But these methods are data intensive, have manual touch points and have not yet reported the end-to-end process's performance. In this paper, we present the results of a new method that uses the Knowledge Graph with Rule based reasoning for information extraction for the SAPPhIRE causality model from natural language texts. Unlike the supervised learning methods, this new method is not data intensive. We report the performance of the end-to-end information extraction process, which is found to be a promising alternative.
... The achieved structured information represents an index about the complexity of the patent and the pertaining degree of inventiveness in the perspective of identifying the possible evolution patterns of the studied technical systems. The objective of assessing the evolution trajectories is strengthened by the automated building and confrontation of thesauri concerning different technical fields [17]; this kind of algorithm also allows to build ontologies with entities and related relationships and map key problems of the investigated field. ...
Conference Paper
Full-text available
The growing complexity of technical solutions, which encompass knowledge from different scientific fields, makes necessary, also for multi-disciplinary working teams, the consultation of information sources. Indeed, tacit knowledge is essential, but often not sufficient to achieve a proficient problem solving process. Besides, the most comprehensive tool of the TRIZ body of knowledge, i.e. ARIZ, requires, more or less explicitly, the retrieval of new knowledge in order to entirely exploit its potential to drive towards valuable solutions. A multitude of contributions from the literature support various common tasks encountered when using TRIZ and requiring additional information; most of them hold the objective of speeding up the generation of inventive solutions thanks to the capabilities of text mining techniques. Nevertheless, no global study has been conducted to fully disclose the effective knowledge requirements of ARIZ. With respect to this deficiency, the present paper illustrates an analysis of the algorithm with the specific objective of identifying the different types of information needs that can be satisfied by patents. The results of the investigation lay bare the most significant gaps of the research in the field. Further on, an initial proposal is advanced to structure the retrieval of relevant information from patent sources currently not supported by existing methodologies and software applications, so as to exploit the vast amount of technical knowledge contained in there. An illustrative experiment sheds light on the relevance of control parameters as input terms for the definition of search queries aimed at retrieving patents sharing the same physical contradiction of the problem to be treated.
Article
Full-text available
The application of standard Information Extraction techniques to Patent Analysis has several limitations partially due to the difference existing between patents and web pages, which are the object of the biggest majority of information search. Indeed, while in other fields customized processing techniques have been developed, the number of studies fully dedicated to patent text mining is very limited and the tools available on the market still require a relevant human workload. This paper presents an algorithm to identify the peculiarities of an invention through an automatic functional analysis of the patent text; as a result a ranked list of components and functions is provided as well as a selection of meaningful paragraphs disclosing the details of the invention. An example related to laser irradiation devices for medical treatment clarifies its basic steps.
Article
Full-text available
Cited By (since 1996): 1, Export Date: 11 December 2012, Source: Scopus
Article
Full-text available
TRIZ, the Soviet-initiated Theory of Inventive Problem Solving, is gaining acknowledgement both as a systematic methodology for innovation and a powerful tool for technology forecasting. Nevertheless, the analysis of patents necessary for gathering the data to be used for the previsional activity is very cumbersome and sometimes unworthy due to the intrinsic low reliability of forecasting tasks. With this perspective it is necessary to speed up the identification of the technical/physical conflict(s) overcome by an invention, according to its textual description. Although text-mining tools have reached relevant capabilities for extracting useful information from huge sets of documents, no specific means are available to support the analysis of patents with the aim of identifying the contradiction underlying a given technical system. This paper proposes a computer-aided approach for accomplishing such a task: the algorithm is described and validated by means of practical examples.
Conference Paper
Full-text available
TRIZ literature presents several papers and even books claiming the efficiency of Altshuller's Laws of Engineering System Evolution as a means for producing technology forecasts. Nevertheless, all the instruments and the procedures proposed so far suffer from poor repeatability, while the increasing adoption of innovation as the key factor for being competitive requires reliable and repeatable methods and tools for the analysis of emerging technologies and their potential impact. The present paper proposes an original algorithm to perform a functional analysis aimed at building a Network of Evolutionary Trends for a given Technical System with repeatable steps. Such a goal has been achieved by integrating well known models and instruments for system description and function representation. The overall procedure has been already validated in a number of industrial case studies and it's here clarified by means of an example about the production of tablets in the pharmaceutical manufacturing sector.
Article
Many forms of analyzing future technology and its consequences coexist, for example, technology intelligence, forecasting, roadmapping, assessment, and foresight. All of these techniques fit into a field we call technology futures analysis (TFA). These methods have matured rather separately, with little interchange and sharing of information on methods and processes. There is a range of experience in the use of all of these, but changes in the technologies in which these methods are used—from industrial to information and molecular—make it necessary to reconsider the TFA methods. New methods need to be explored to take advantage of information resources and new approaches to complex systems. Examination of the processes sheds light on ways to improve the usefulness of TFA to a variety of potential users, from corporate managers to national policy makers. Sharing perspectives among the several TFA forms and introducing new approaches from other fields should advance TFA methods and processes to better inform technology management as well as science and research policy.
Article
Cited By (since 1996): 1, Export Date: 11 December 2012, Source: Scopus