ArticlePDF Available

Towards a Theory of Formal Classification

Authors:

Fausto Giunchiglia

Università degli Studi di Trento

Maurizio Marchese

Università degli Studi di Trento

Ilya Zaihrayeu

Università degli Studi di Trento

Classifications have been used for centuries with the goal of cataloguing and searching large sets of objects. In the early days it was mainly books; lately it has become Web pages, pictures and any kind of electronic information items. Classifications describe their contents using natural language labels, an approach which has proved very effective in manual classification. However natural language labels show their limitations when one tries to automate the process, as they make it almost impossible to reason about classifications and their contents. In this paper we introduce the novel notion of Formal Classification, as a graph structure where labels are written in a logical concept language. The main property of Formal Classifications is that each node can be associated to a normal form formula which univocally describes its contents. This in turn allows us to reduce document classification and query answering to fully automatic propositional reasoning.

Amazon Book Directory

…

Edge semantics for formal classifications

…

Example of general intersection

…

Figures - uploaded by Ilya Zaihrayeu

Content may be subject to copyright.

Content uploaded by Ilya Zaihrayeu

Content may be subject to copyright.

UNIVERSITY

OF TRENTO

DEPARTMENT OF INFORMATION AND COMMUNICATION TECHNOLOGY

38050 Povo – Trento (Italy), Via Sommarive 14

http://www.dit.unitn.it

TOWARDS A THEORY

OF FORMAL CLASSIFICATION

Fausto Giunchiglia, Maurizio Marchese and Ilya Zaihrayeu

May 2005

Technical Report # DIT-05-048

Towards a Theory of Formal Classiﬁcation

Fausto Giunchiglia, Maurizio Marchese, Ilya Zaihrayeu

{fausto, marchese, ilya}@dit.unitn.it

Department of Information and Communication Technology

University of Trento, Italy

Abstract. Classiﬁcations have been used for centuries with the goal of

cataloguing and searching large sets of objects. In the early days it was

mainly books; lately it has become Web pages, pictures and any kind of

electronic information items. Classiﬁcations describe their contents using

natural language labels, an approach which has proved very eﬀective in

manual classiﬁcation. However natural language labels show their limi-

tations when one tries to automate the process, as they make it almost

impossible to reason about classiﬁcations and their contents. In this pa-

per we introduce the novel notion of Formal Classiﬁcation, as a graph

structure where labels are written in a logical concept language. The

main property of Formal Classiﬁcations is that each node can be associ-

ated a normal form formula which univocally describes its contents. This

in turn allows us to reduce document classiﬁcation and query answering

to fully automatic propositional reasoning.

1 Introduction

In today’s information society, as the amount of information grows larger, it

becomes essential to develop eﬃcient ways to summarize and navigate informa-

tion from large, multivariate data sets. The ﬁeld of classiﬁcation supports these

tasks, as it investigates how sets of “objects” can be summarized into a small

number of classes, and it also provides methods to assist the search of such “ob-

jects” [6]. In the past centuries, classiﬁcation has been the domain of librarians

and archivists. Lately a lot of interest has focused also on the management of

the information present in the web: see for instance the WWW Virtual Library

project1, or the directories of search engines like Google, or Yahoo!.

Standard classiﬁcation methodologies amount to manually organizing topics

into hierarchies. Hierarchical library classiﬁcation systems (such as the Dewey

Decimal Classiﬁcation System (DDC)2or the Library of Congress classiﬁcation

system (LCC)3) are attempts to develop static, hierarchical classiﬁcation struc-

tures into which all of human knowledge can be classiﬁed. Although these are

standard and universal techniques; they have a number of limitations:

1The WWW Virtual Library project, see http://vlib.org/.

2The Dewey Decimal Classiﬁcation system, see http://www.oclc.org/dewey/.

3The Library of Congress Classiﬁcation system, see

http://www.loc.gov/catdir/cpso/lcco/lcco.html/.

–both classiﬁcation and search tasks do not scale to large amounts of infor-

mation. This is because, among other things, at any given level in such a

hierarchy, there may be more than one choice of topic under which an object

might be classiﬁed or searched.

–the semantics of a given topic is implicitly codiﬁed in a natural language

label. These labels must therefore be interpreted and disambiguated.

–the semantic interpretation of a given topic depends also on the meanings

associated to the labels at higher levels in the hierarchy [10].

In the present paper we propose a formal approach to classiﬁcation, capable

of capturing the implicit knowledge present in classiﬁcation hierarchies, and of

supporting automated reasoning to help humans in their classiﬁcation and search

tasks. To this end, we propose a two step approach:

–ﬁrst we convert a classiﬁcation into a new structure, which we call Formal

Classiﬁcation (FC ), where all the labels are expressed in a Propositional

Description Logic language4, that we call the Concept Language.

–then we further convert a FC into a Normalized Formal Classiﬁcation (NFC ).

In NFCs each node is associated a Concept Language formula, that we call

the concept at a node, which univocally codiﬁes the node contents, taking

into account both the label of the node and its position within the classiﬁ-

cation.

NFCs and concepts at nodes have many nice properties. Among them:

–they can be expressed in Conjunctive and/or Disjunctive Normal Forms

(CNF / DNF). This allows humans and machines to easily inspect and reason

on classiﬁcations (both visually and computationally).

–document classiﬁcation and query answering can be done simply exploiting

the univocally deﬁned semantics codiﬁed in concepts at nodes. There is no

need to inspect the edge structure of the classiﬁcation.

–concepts of nodes are organized in a taxonomic structure where, from the

root down to the leaves of the classiﬁcation, child nodes are subsumed by

their parent nodes.

The remainder of the paper is organized as follows. In Section 2 we introduce

and present examples of standard classiﬁcations. In Section 3 we introduce the

deﬁnition of FC and discuss its properties. In Section 4 we introduce the notion

of NFC and its properties. In Section 5 we show how the two main operations

performed on classiﬁcations, namely classiﬁcation and search, can be fully auto-

mated in NFCs as a propositional satisﬁability problem. The related and future

work conclude the paper.

4A Propositional Description Logic language is a Description Logic language [1]

without roles.

2 Classiﬁcations

Classiﬁcations are hierarchical structures used to organize large amounts of ob-

jects [10]. These objects can be of many diﬀerent types, depending on the charac-

teristics and uses of the classiﬁcation itself. In a library, they are mainly books

or journals; in a ﬁle system, they can be any kind of ﬁle (e.g., text ﬁles, im-

ages, applications); in the directories of Web portals, the objects are pointers to

Web pages; in market places, catalogs organize either product data or service ti-

tles. Classiﬁcations are useful for both objects classiﬁcation and retrieval. Users

browse the hierarchies and quickly catalogue or access the objects associated

with diﬀerent concepts and linked to natural languages labels. We deﬁne the

notion of Classiﬁcation as follows:

Deﬁnition 1 (Classiﬁcation) A Classiﬁcation is a rooted tree described by a

triple H=hC, E, l iwhere Cis a ﬁnite set of nodes, Eis a set of edges on C,

and lis a function from Cto a set Lof labels expressed in a natural language.

In the rest of this section we describe and brieﬂy discuss two diﬀerent Classiﬁ-

cations: a librarian classiﬁcation hierarchy Dewey Decimal Classiﬁcation system

(DDC), and an example from a modern web catalogue, namely the Amazon book

categories catalogue.

Example 1 (DDC). Since the 19th century, librarians have used DDC to

organize vast amounts of books. DDC divides knowledge into ten diﬀerent broad

subject areas, called classes, numbered 000 - 999. Materials which are too general

to belong to a speciﬁc group (encyclopedias, newspapers, magazines, etc.) are

placed in the 000’s. The ten main classes are divided up into smaller classes by

several sets of subclasses. Smaller divisions (to subdivide the topic even further)

are created by expanding each subclass and adding decimals if necessary. A small

part of the DDC system is shown on Figure 1.

500 Natural Science and Mathematics

520 Astronomy and allied sciences

523 Speciﬁc celestial bodies and phenomena

523.1 The universe

523.2 Solar system

523.3 The Earth

523.4 The moon

523.5 Planets

523.51 Mercury

523.52 Venus

523.53 Mars →523.53HAN

. . .

Fig. 1. A part of the DDC system with an example of book classiﬁcation

In DDC, the notation (i.e., the system of symbols used to represent the classes

in a classiﬁcation system) provides a universal language to identify the class and

related classes.

Before a book is placed on the shelves it is:

–classiﬁed according to the discipline matter it covers (given the Dewey num-

ber);

–some letters (usually three) are added to this number (usually they represent

the author’s last name);

–the number is used to identify the book and to indicate where the book will be

shelved in the library. Books can be assigned a Dewey number corresponding

to both leaf and non-leaf nodes of the classiﬁcation hierarchy.

Since parts of DDC are arranged by discipline, not subject, a subject may

appear in more than one class. For example, the subject “clothing” has aspects

that fall under several disciplines. The psychological inﬂuence of clothing belongs

in 155.95 as part of the discipline of psychology; customs associated with clothing

belong in 391 as part of the discipline of customs; and clothing in the sense of

fashion design belongs in 746.92 as part of the discipline of the arts. However,

the ﬁnal Dewey number associated to a book is unique and the classiﬁer needs

to impose a classiﬁcation choice.

As an example, let’s see how to determine the Dewey number for the follow-

ing book: Michael Hanlon, “The Real Mars”. A possible classiﬁcation is Dewey

number: 523.53 HAN and the classiﬁcation choice for the book is shown in Fig-

ure 1.

The main properties of DDC are:

–the classiﬁcation algorithm relies on the “Get Speciﬁc” criterion5: when you

add a new object, get as speciﬁc as possible: dig deep into the classiﬁcation

schema, looking for the appropriate sub-category; it is bad practice to submit

an object to a top level category, if one more speciﬁc exists. At present, the

enforcement of such criterion is left to the experience of the classiﬁer.

–each object is placed in exactly one place in the hierarchy. As a result of

this restriction, a classiﬁer often has to choose arbitrarily among several

reasonable categories to assign the classiﬁcation code for a new document

(see the above example for “clothing”). Despite the use of documents called

“subject authorities”, which attempt to impose some control on terminology

and classiﬁcation criteria, there is no guarantee that two classiﬁers make

the same decision. Thus, a user, searching for information, has to guess the

classiﬁer’s choice to decide where to look for, and will typically have to look

in a number of places.

–each non-root node in the hierarchy has only one parent node. This enforces

a tree structure on the hierarchy.

5Look at http://docs.yahoo.com/info/suggest/appropriate.html to see how Yahoo!

implements this rule.

Example 2 (Amazon book directory). Many search engines like Google,

Yahoo as well as many eCommerce vendors, like Amazon, oﬀer mechanisms to

search for relevant items. This is the case, for instance, of the web directory cat-

alogue for books (among other items) used in Amazon. At present Amazon has

35 main subjects. Books are inserted by the classiﬁer in the web directory, and

users browse such classiﬁcation hierarchy to access the books they are interested

in.

In Amazon, as in DDC, books can be classiﬁed both in leaf and non-leaf

nodes6, following the “Get Speciﬁc” criterion, but also the “Related Directory”

criterion7, when the classiﬁer browses through the hierarchy looking for an ap-

propriate category that lists similar documents. In this classiﬁcation hierarchy, a

book can be often reached from diﬀerent paths of the hierarchy, thus providing

eﬃcient tools to arrive at items of interest using diﬀerent perspectives.

In the following we present an example of classiﬁcation for a software pro-

gramming book in the Amazon Book Web Directory. The book title is “Enter-

prise Java Beans, Fourth Edition”. In the current Amazon book directory8, the

example title can be found through two diﬀerent search paths (see Figure 2),

namely:

Subjects →Business and Investing →

Small Business and Entrepreneurship →New Business Enterprises

Subjects →Computers and Internet →Programming →Java Language →

Java Beans

From the brief presentation and from the two speciﬁc examples we can see

that Web catalogues are more open than classiﬁcations like Dewey. In fact, their

aim is not to try to position a resource in a unique position, but rather to position

it in such a way, that a user, who navigates the catalogue, will be facilitated to

ﬁnd appropriate or similar resources related to a given topic.

3 Formal Classiﬁcations

Let us use the two examples above to present and discuss a number of charac-

teristics that are relevant to classiﬁcations and that need to be considered in a

formal theory of classiﬁcation.

Let us start from the characteristics of edges. People consider classiﬁcations

top down. Namely, when classifying or searching for a document ﬁrst upper level

nodes are considered, and then, if these nodes are too general for the given cri-

teria, lower level nodes may also be inspected. Child nodes in a classiﬁcation are

6Amazon implements it by assigning to non-leaf nodes a leaf node labeled “General”,

where items related to the non-leaf nodes are classiﬁed

7Look at http://www.google.com/dirhelp.html#related to see how Google imple-

ments this rule

8See http://www.amazon.com, April 2005.

Fig. 2. Amazon Book Directory

always considered in the context of their parent nodes, and therefore specialize

the meaning of the parent nodes. In a classiﬁcation there are two possible mean-

ingful interrelationships between parent and child nodes as shown on Figure 3:

Fig. 3. Edge semantics for formal classiﬁcations

–Case (a) represents edges expressing the “general intersection” relation, and,

intuitively, the meaning of node 2 is area C, which is the intersection of areas

Aand B.

For instance, in our Amazon example, the edge in Figure 2 Computers and

Internet →Programming codiﬁes all the items that are in common (see Fig-

ure 4) to the categories Computers and Internet (i.e., hardware, software,

Fig. 4. Example of general intersection

networking, etc) and Programming (i.e., scheduling, planning, computer pro-

gramming, web programming, etc). This kind of edges are also present in li-

brary systems, such as DDC, at lower levels of the hierarchy where diﬀerent

facets of a particular parent category are considered.

–Case (b) represents a more speciﬁc case where the child node is “subsumed

by” the parent node. In this case the meaning of node 2 is area B. This kind

of edges is also called an “is-a” edge. Note that in this case, diﬀerently from

case (a), node Adoes not inﬂuence what is classiﬁed in node B.

Many edges in DDC impose the “is-a” relation, in particular in the higher

levels of the hierarchy. Also some edges in the Amazon book directory impose

the “is-a” links, the most obvious ones are the edges from the root category.

Notice that, in the case of edges leading to the same resource the “general

intersection” relation must hold for all the categories in all the diﬀerent paths.

The latter fact can be used to improve the classiﬁcation representation: either

by trying to prohibit this situation (if the goal is to classify unambiguously a

resource, as it happens in a library classiﬁcation, such as DDC) or by enhancing

this kind of situation (if the goal is improving the recall of relevant resources, as

it happens in a web catalogue, such as Amazon).

Let us now move to consider the characteristics of labels. As from Deﬁnition 1,

the concept of a speciﬁc node is described by a label expressed in words and,

possibly, separators between them. The node labels possess interesting structure,

relevant to formal classiﬁcation hierarchies:

–Natural language labels are composed by atomic elements, namely words.

These words can be analyzed in order to ﬁnd all their possible basic forms and

eventual multiple senses, i.e., the way in which the word can be interpreted.

In this paper, we use WordNet [12] to retrieve word senses9, however, in

9We may change the actual senses of a word from WordNet for the sake of presenta-

tion.

practice, a diﬀerent thesaurus can be used. For example the word “Java” in

the label “Java Language” in Figure 2 possesses diﬀerent equivalent forms

(e.g., Java, java) and three diﬀerent senses:

1. an island in Indonesia;

2. a beverage consisting of an infusion of ground coﬀee beans; and

3. an object oriented programming language.

–Words are combined to build complex concepts out of the atomic elements.

Consider for example the labels Computers and Internet and Java Language

in Figure 2. The combination of natural language atomic elements is used

by classiﬁer to aggregate (like in Computers and Internet) or disambiguate

atomic concepts (like in Java Language, where the sense of the word Java

that denotes “an island in Indonesia” together with the sense “a type of

coﬀee” can be discarded while the correct sense of “an object oriented pro-

gramming language” is maintained).

–Natural language labels make use of the structure of the classiﬁcation hier-

archy to improve the semantic interpretation associated to a given node. We

call this property parental contextuality of a node. For instance the sense of

words composing labels of diﬀerent nodes in an hierarchy path can be in-

compatible; thus the correct meaning of a particular word in a speciﬁc label

can be disambiguated by considering the senses of the words in some labels

along the path. For example, in the path Java Languages →Java Bean,

the possible correct (but wrong) sense of Java Bean as “a particular type of

coﬀee bean” can be pruned by the classiﬁer taking into account the meaning

of the parent node’s label, Java Languages.

Let us see how we can convert classiﬁcations into a new structure, which we

call a Formal Classiﬁcation (FC), more amenable to automated processing:

Deﬁnition 2 (Formal Classiﬁcation) A Formal Classiﬁcation is a rooted tree

described by a triple HF=hC, E , lFiwhere Cis a ﬁnite set of nodes, Eis a set

of edges on C, and lFis a function from Cto a set LFof labels expressed in a

Propositional Description Logic language LC.

As it can be noticed, the key step is that in FCs labels are substituted by

labels written in a formal logical language. In the following we will call LC,

the Concept Language. We use a Propositional Description Logic language for

several reasons. First, we move from an ambiguous language to a formal lan-

guage with clear semantics. Second, given its set-theoretic interpretation, LC

“maps” naturally to the real world semantics. For instance, the atomic proposi-

tion p=computer denotes “the set of machines capable of performing calcula-

tions automatically”. Third, natural language labels are usually short expressions

or phrases having simple syntactical structure. Thus no sophisticated natural

language processing and knowledge representation techniques are required – a

phrase can be often converted into a formula in LCwith no or little loss in the

meaning. Forth, a formula in LCcan be converted into an equivalent formula in a

propositional logic language with boolean semantics. Thus a problem expressed

in LCcan therefore be converted into a propositional satisﬁability problem10.

Apart from the atomic propositions, the language LCincludes logical op-

erators, such as conjunction (denoted by u), disjunction (denoted by t), and

negation (¬); as well as comparison operators: more general (w), more speciﬁc

(v), and equivalence (≡). In the following we will also say that Asubsumes

B, if AwB; and we will also say that Ais subsumed by B, if AvB. The

interpretation of the operators is the standard set-theoretic interpretation.

We build FCs out of classiﬁcations by translating, using natural language

processing techniques, natural language labels, li’s, into concept language la-

bels, lF

i’s. For lack of space we do not describe here how we perform this step.

The interested reader is referred to [10]. As an example, recall the classiﬁcation

example shown on Figure 2. For instance, the label Java beans of node n8is

translated into the following expression:

8= (Java1tJava2tJava3)u(Bean1tBean2) (1)

where Java1denotes the Java island, Java2is a brewed coﬀee, Java3is the object

oriented programming language Java, Bean1is a kind of seeds, and Bean2is a

Java technology related term. The disjunction tis used to codify the fact that

Java and Bean may mean diﬀerent things. The conjunction uis used to codify

that the meaning of Java beans must take into account what Java means and

what Beans mean.

As it is mentioned above, some senses of a word in a label may be incompat-

ible with the senses of the other words in the label, and, therefore, these senses

can be discarded. A way to check this in LCis to convert a label into Disjunc-

tive Normal Form (DNF). A formula in DNF is a disjunction of conjunctions of

atomic formulas or negation of atomic formulas, where each block of conjunc-

tions is called a clause [11]. Below is the result of conversion of Formula 1 into

DNF:

8= (Bean1uJava1)t(Bean1uJava2)t(Bean1uJava3)t

(Bean2uJava1)t(Bean2uJava2)t(Bean2uJava3)(2)

The ﬁrst clause in Formula 2 (i.e., (Bean1uJava1)) can be discarded, as there

is nothing in common between seeds and the island. The second clause, instead,

is meaningful – it denotes the coﬀee seeds. Analogously, clauses 3, 4 and 5 are

discarded and clause 6 is preserved. The ﬁnal formula for the label of node n8

therefore becomes:

8= (Bean1uJava2)t(Bean2uJava3) (3)

Note, that sense Java1is pruned away in the ﬁnal formula as it has nothing

to do with any sense of the word “bean”. Analogously, all the other labels in

10 For translation rules from a Propositional Description Logic to a Propositional Logic,

see [2, 5].

the classiﬁcation shown on Figure 2 are translated into expressions in LCand

further simpliﬁed. At this point, the “converted” Classiﬁcation represents a FC.

Note, that each clause in DNF represents a distinct meaning encoded into

the label. This fact allows both agents and classiﬁers to operate on meanings of

labels, and not on meanings of single words.

4 Normalized Formal Classiﬁcations

As discussed in Section 2, in classiﬁcations, child nodes are considered in the

context of their parent nodes. We formalize this notion of parental context in a

FC following the deﬁnition of concept at a node from [5]:

Deﬁnition 3 (Concept at a node) Let HFbe a FC and nibe a node of HF.

Then, the concept at node ni, written Ci, is its label lF

iif niis the root of HF,

and, otherwise, it is the conjunction of the label of niand the concept at node

nj, which is the parent of ni. In formulas:

Ci=½lF

iif niis the root of HF

iuCjif niis a non-root node of HF, where njis the parent of ni

Applying Deﬁnition 3 recursively, we can compute the concept at any non-root

node nias the conjunction of the labels of all the nodes on the path from the

root of HFto ni:

Ci=lF

1ulF

2u. . . ulF

i(4)

The notion of concept at a node explicitly captures the classiﬁcation seman-

tics. Namely, the interpretation of the concept at a node is the set of objects that

the node and all its ascendants have in common (see Figure 3). From the classi-

ﬁcation point of view, the concept at a node deﬁnes what (class of) documents

can be classiﬁed in this node.

The deﬁnition of concept at a node possesses a number of important prop-

erties relevant to classiﬁcation:

Property C.1: each Cicodiﬁes both the label of niand the path from the root

to ni. There are two important consequences of this: ﬁrst, it allows it to prune

away irrelevant senses along the path; and, if converted to DNF, Cirepresents

the union of all the possible distinct meanings of a node in the FC’s tree.

Recall the Amazon running example. According to Formula 4, the concept

at node n8is:

C8= (Subject∗)u(Computer∗tInternet∗)u(Programming∗)u(Java∗u

Language∗)u(Java∗uBean∗)11

The possible correct (but wrong) sense (Bean1uJava2) as “a particular type

of coﬀee bean” (the ﬁrst clause in Formula 3) can be pruned by converting the

concept at node n8into DNF, which contains the clause (Language1uJava2u

11 We write X∗to denote the disjunction of all the senses of X.

Bean1) and checking it as a propositional satisﬁability problem: since the mean-

ing of Language1is “incompatible” with Java2the expression results into an

inconsistency.

Property C.2: each Cihas a normal form. In fact it is always possible to

transform each Ciin Conjunctive Normal Form (CNF) namely a conjunction of

disjunctions of atomic formulas or negation of atomic formulas [11]. Therefore

Cicodiﬁes in one logical expression all the possible ways of conveying the same

concept associated to a node.

We use the notion of the concept of a node to deﬁne a further new structure

which we call Normalized Formal Classiﬁcation (NFC):

Deﬁnition 4 (Normalized Formal Classiﬁcation) A Normalized Formal Clas-

siﬁcation is a rooted tree described by a triple HN=hC, E , lNiwhere Cis a ﬁnite

set of nodes, Eis a set of edges on C, and lNis a function from Cto a set LN

of concepts at nodes.

Also the proposed NFC possesses a number of important properties relevant

to classiﬁcation:

Property NFC.1: when all Ciare expressed in CNF (see property C.2),

all the nodes expressing semantically equivalent concepts will collapse to the

same CNF expression. Even when two computed concepts are not equivalent,

the comparison of the two CNF expressions will provide enhanced similarity

analysis capability to support both classiﬁcation and query-answering tasks.

Following our example, the normalized form of the concept at node n8with the

path (in natural language):

Subjects →Computers and Internet →Programming →Java Language →

Java Beans

will be equivalent, for instance, to the concept associated to a path like:

Topic →Computer →Internet →Programming →Languages →Java →

Java Beans

and similar (i.e., be more general, or more speciﬁc) to (say):

Discipline →Computer Science →Programming languages →Java →

J2EE →Java Beans

Property NFC.2: any NFC is a taxonomy, in the sense that for any non-root

node niand its concept Ci, the concept Ciis always subsumed by Cj, where

njis the parent node of ni. We claim that NFCs are the “correct” translations

of classiﬁcations into ontological taxonomies as they codify the intended seman-

tics/use of classiﬁcations. Notice that, under this assumption, in order to capture

the classiﬁcation semantics no expressive ontological languages are needed, and

a Propositional Description Logic is suﬃcient. In this respect our work diﬀers

substantially from the work described in [10].

Consider in our running Amazon example the path in the natural language clas-

siﬁcation:

Subject →Computers and Internet →Programming

As described in Section 2, this path contains a link expressing the “general inter-

section” relation, namely the link is Computers and Internet →Programming

(see Figure 4). The same relation is maintained when we move to FCs. In our

notation: lF

1=Subject∗,lF

3= (Computer∗tInternet∗), lF

5=Programming∗.

But, when we move to the NFC for the given example, our elements become:

C1=lF

1;C3=lF

1ulF

3;C5=lF

1ulF

3ulF

5; and the only relation holding between

successive element is the subsumption.

The above properties of both Ciand NFC have interesting implications in

classiﬁcation and query answering, as described in the next Section.

5 Document classiﬁcation and query answering

We assume that each document dis assigned an expression in LC, which we call

the document concept, written Cd. The assignment of concepts to documents

is done in two steps: ﬁrst, a set of document’s keywords is retrieved using text

mining techniques (see, for example, [14]); the keywords are then converted into

a corresponding concept using the same techniques used to translate natural

language labels into concept language labels (see Section 3).

There exists a number of approaches to how to classify a document. In one

such approach a document is classiﬁed only in one node (as in DDC), in another

approach it may be classiﬁed under several nodes (as in Amazon). However, in

most cases, the general rule is to classify a document in the node or in the nodes

that most speciﬁcally describe the document, i.e., to follow the “Get Speciﬁc”

criterion discussed in Section 2. In our approach, we allow for a document to be

classiﬁed in more than one node, and we also follow the “Get Speciﬁc” criterion.

We express these criteria, in a formal way, as follows:

Deﬁnition 5 (Classiﬁcation Set) Let HNbe a NFC, dbe a document, and

Cdbe the concept of d. Then, the classiﬁcation set for din HN, written Cld, is

a set of nodes {ni}, such that for any node ni∈C ldthe following two conditions

hold:

1. the concept at node niis more general than Cd, i.e. CdvCi; and

2. there is no such node nj(j6=i), whose concept at node is more speciﬁc than

Ciand more general than Cd.

Document dis classiﬁed in all the nodes from the set Cldin Deﬁnition 5.

Suppose we are given two documents: a book on Java programming (d1)

and an article on high tech entrepreneurship (d2). Suppose now that these

documents are assigned the following concepts: Cd

1=Java3uProgramming2,

and Cd

2=High tech1uVenture3, where Java3is the programming language,

Programming2is computer programming, High tech1is “highly advanced tech-

nological development”, and Venture3is “a commercial undertaking that risks

a loss but promises a proﬁt”. Intuitively, Cd

1is more speciﬁc than the concept

at the node labeled Java language in the classiﬁcation shown on Figure 2. In

fact, logical inference conﬁrms the intuition, namely it is possible to show that

the following relation holds: Cd

1vC7. It is also possible to show that the second

condition of Deﬁnition 5 holds for node n7. Thus, document d1is classiﬁed in

node n7. Analogously, it can be shown that the classiﬁcation set for d2is com-

posed of the single node n6. For lack of space we do not show the full formulas

and the proofs of these statements.

Moving to query answering, when a user searches for a document, she deﬁnes

a set of keywords or a phrase, which is then converted into an expression in

LCusing the same techniques discussed in Section 3. We call this expression, a

query concept, written Cq. We deﬁne the answer Aqto a query qas the set of

documents, whose concepts are more speciﬁc than the query concept for q:

Aq={d|CdvCq}(5)

Searching directly on all the documents may become prohibitory expensive as

classiﬁcations may contain thousands and millions of documents. NFCs allow us

to identify the maximal set of nodes which contain only answers to a query, which

we call, the sound classiﬁcation answer to a query (written Nq

s). We compute

sas follows:

s={ni|CivCq}(6)

In fact, as CdvCifor any document dclassiﬁed in any node ni∈Nq

s, and

CivCq, then CdvCq. Thus, all the documents classiﬁed in the set of nodes

sbelong to the answer Aq(see Formula 5).

We extend Nq

sby adding nodes, which constitute the classiﬁcation set of a

document d, whose concept is Cd=Cq. We call this set, the query classiﬁcation

set, written Clq; and we compute it following Deﬁnition 5. In fact, nodes in Clq

may contain documents satisfying Formula 5, for instance, documents whose

concepts are equivalent to Cq.

Suppose, for instance, that a user deﬁnes the following query to the Ama-

zon NFC: Cq=Java3tCOBOL1, where COBOL1is “common business-oriented

language”. It can be shown, that Nq

s={n7, n8}(see Figure 2 for the Amazon

classiﬁcation). However, this set does not include node n5, which contains the

book “Java for COBOL Programmers (2nd Edition)”. Node n5can be identiﬁed

by computing the query classiﬁcation set for query q, which in fact consists of

the single node n5, i.e. Clq={n5}. However, n5may also contain irrelevant

documents.

Thus, for any query q, a user can compute a sound query answer Aq

sby taking

the union of two sets of documents: the set of documents which are classiﬁed in

the set of nodes Nq

s(computed as {d∈ni|ni∈Nq

s}); and the set of documents

which are classiﬁed in the nodes from the set Clqand which satisfy Formula 5

(computed as {d∈ni|ni∈Clq, CdvCq}). We have therefore:

s={d∈ni|ni∈Nq

s}∪{d∈ni|ni∈Clq, C dvCq}(7)

Under the given deﬁnition, the answer to a query is not restricted to the doc-

uments classiﬁed in the nodes, whose concepts are the ”closest” match to the

query. Documents from nodes, whose concepts are more speciﬁc than the query

are also returned. For instance, a result for the above mentioned query may also

contain documents about Java beans.

Note, that the structure of a NFC (i.e., the edges) is not considered neither

during document classiﬁcation nor during query answering. In fact, given the

proposed classiﬁcation algorithm, the edges information becomes redundant, as

it is implicitly encoded in the concepts at the nodes. We say implicitly because

there may be more than one way to “reconstruct” a NFC resulting into the same

set of concepts at nodes. But, all the possible NFCs are equivalent, in the sense

that the same set of documents is classiﬁed into exactly the same set of nodes.

The algorithms presented in this section are sound and complete in the doc-

ument classiﬁcation part, as Propositional Logic allows for sound and complete

reasoning on documents according to Deﬁnition 5. The proposed solution for

query answering is sound but not complete as Aq

s⊆Aq. For lack of space we do

not provide evidence of the incompleteness property of the solution.

6 Related Work

In our work we adopt the notion of the concept at node as ﬁrst introduced in [4]

and further elaborated in [5]. Moreover, the notion of label of a node in a FC,

semantically corresponds to the notion of the concept of a label introduced in [5].

In [5] these notions play the key role in the identiﬁcation of semantic mappings

between nodes of two schemas. In this paper, these are the key notions needed

to deﬁne NFCs.

This work as well as the work in [4,5] mentioned above is crucially related

and depends on the work described in [2, 10]. In particular, in [2], the authors,

for the ﬁrst time ever, introduce the idea that in classiﬁcations, natural language

labels should be translated in logical formulas, while, in [10], the authors provide

a detailed account of how to perform this translation process. The work in [4, 5]

improves on the work in [2, 10] by understanding the crucial role that concepts

at nodes have in matching heterogeneous classiﬁcations and how this leads to a

completely new way to do matching. As a matter of fact the work in [4] classiﬁes

the work in [2, 4, 5, 10] as semantic matching and distinguishes it from all the

previous work, classiﬁed under the heading syntactic matching. This paper, for

the ﬁrst time, recognizes the crucial role that the ideas introduced in [2, 4, 5, 10]

have in the construction of a new theory of classiﬁcation, and in introducing the

key notion of FC.

A lot of work in information theory, and more precisely on formal concept

analysis (see for instance [16]) has concentrated on the study of concept hierar-

chies. NFCs are what in formal concept analysis are called concept hierarchies

with no attributes. The work in this paper can be considered as a ﬁrst step to-

wards providing a computational theory of how to transform the “usual” natural

language classiﬁcations into concept hierarchies. Remember that concept hier-

archies are ontologies which are trees where parent nodes subsume their child

nodes.

The classiﬁcation and query answering algorithms, proposed in this paper, are

similar to what in the Description Logic (DL) community is called realization and

retrieval respectively. The fundamental diﬀerence between the two approaches

is in that in DL the underlying structure for classiﬁcation is not predeﬁned

by the user, but is build bottom-up from atomic concepts by computing the

subsumption partial ordering. Interested readers are referenced to [7], where the

authors propose sound and complete algorithms for realization and retrieval.

In Computer Science, the term classiﬁcation is primarily seen as the process

of arranging a set of objects (e.g., documents) into categories or classes. There

exist a number of diﬀerent approaches which try to build classiﬁcations bottom-

up, by analyzing the contents of documents. These approaches can be grouped in

two main categories: supervised classiﬁcation, and unsupervised classiﬁcation. In

the former case, a small set of training examples needs to be prepopulated into

the categories in order to allow the system to automatically classify a larger set

of objects (see, for example, [3, 13]). The latter approach uses various machine

learning techniques to classify objects, for instance, data clustering [8]. There

exist some approaches that apply (mostly) supervised classiﬁcation techniques

to the problem of documents classiﬁcation into hierarchies [9,15]. The classiﬁca-

tions built following our approach are better and more natural than those built

following these approaches. They are in fact constructed top-down, as chosen

by the user and not constructed bottom-up, as they come out of the document

analysis. Notice how in this latter case the user has no or little control over the

language used in classiﬁcations. Our approach has the potential, in principle,

to allow for the automatic classiﬁcation of the (say) Yahoo! documents into the

Yahoo! directories. Some of our current work is aimed at testing the feasibility

of our approach with very large sets of documents.

7 Conclusions

In this paper we have introduced the notion of Formal Classiﬁcation, namely

of a classiﬁcation where labels are written in a propositional concept language.

Formal Classiﬁcations have many advantages over standard classiﬁcations all

deriving from the fact that formal language formulas can be reasoned about far

more easily than natural language sentences. In this paper we have highlighted

how this can be done to perform query answering and document classiﬁcation.

However much more can be done. Our future work includes the development of a

sound and complete query answering algorithm; as well as the development and

evaluation of tools that implement the theoretical framework presented in this

paper. There are two tools of particular importance, namely the document clas-

siﬁer and query answering tools, which will provide the functionality described

in Section 5. The performance of the tools will then be compared to the per-

formance of the most advanced heuristics based approaches. Yet another line

of research will be the development of a theoretical framework and algorithms

allowing for the interoperability between NFCs. The latter particularly includes

distributed query answering and multiple document classiﬁcation under sound

and complete semantics.

References

1. Franz Baader, Diego Calvanese, Deborah McGuinness, Daniele Nardi, and Peter

Patel-Schneider. The Description Logic Handbook : Theory, Implementation and

Applications. Cambridge University Press, 2003.

2. P. Bouquet, L. Seraﬁni, and S. Zanobini. Semantic coordination: a new approach

and an application. In Proc. of the 2nd International Semantic Web Conference

(ISWO’03). Sanibel Islands, Florida, USA, October 2003.

3. G.Adami, P.Avesani, and D.Sona. Clustering documents in a web directory. In

Proceedings of Workshop on Internet Data management (WIDM-03), 2003.

4. F. Giunchiglia and P. Shvaiko. Semantic matching. ”Ontologies and Distributed

Systems” workshop, IJCAI, 2003.

5. F. Giunchiglia, P. Shvaiko, and M. Yatskevich. S-match: An algorithm and an

implementation of semantic matching. In Proceedings of ESWS’04, 2004.

6. A.D. Gordon. Classiﬁcation. Monographs on Statistics and Applied Probability.

Chapman-Hall/CRC, Second edition, 1999.

7. Ian Horrocks, Lei Li, Daniele Turi, and Sean Bechhofer. The instance store: DL

reasoning with large numbers of individuals. In Proc. of the 2004 Description Logic

Workshop (DL 2004), pages 31–40, 2004.

8. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM

Computing Surveys, 31(3):264–323, 1999.

9. Daphne Koller and Mehran Sahami. Hierarchically classifying documents using

very few words. In Douglas H. Fisher, editor, Proceedings of ICML-97, 14th In-

ternational Conference on Machine Learning, pages 170–178, Nashville, US, 1997.

Morgan Kaufmann Publishers, San Francisco, US.

10. Bernardo Magnini, Luciano Seraﬁni, and Manuela Speranza. Making explicit the

semantics hidden in schema models. In: Proceedings of the Workshop on Human

Language Technology for the Semantic Web and Web Services, held at ISWC-2003,

Sanibel Island, Florida, October 2003.

11. A. Mendelson. Introduction to Mathematical Logic. Chapman-Hall, 4th ed. London,

1997.

12. George Miller. WordNet: An electronic Lexical Database. MIT Press, 1998.

13. Kamal Nigam, Andrew K. McCallum, Sebastian Thrun, and Tom M. Mitchell.

Text classiﬁcation from labeled and unlabeled documents using EM. Machine

Learning, 39(2/3):103–134, 2000.

14. Fabrizio Sebastiani. Machine learning in automated text categorization. ACM

Computing Surveys, 34(1):1–47, 2002.

15. Aixin Sun and Ee-Peng Lim. Hierarchical text classiﬁcation and evaluation. In

ICDM, pages 521–528, 2001.

16. Rudolf Wille. Concept lattices and conceptual knowledge systems. Computers and

Mathematics with Applications, 23:493–515, 1992.

Access control via lightweight ontologies

Conference Paper

Full-text available

Oct 2011

The paper presents Relation Based Access Control RelBAC, a model and a logic for access control which models communities, possibly nested, and resources, possibly organized inside complex file systems, as lightweight ontologies, and permissions as relations between subjects and objects. RelBAC allows us to represent expressive access control rules beyond the current state of the art, and to deal with the strong dynamics of subjects, objects and permissions which arise in Web 2.0 applications (e.g. social networks). Finally, as shown in the paper, using RelBAC, it becomes possible to reason about access control policies and, in particular to compute candidate permissions by matching subject ontologies (representing their interests) with resource ontologies (describing their characteristics).

Towards semantic social networks

Conference Paper

Full-text available

Oct 2015

LNBIP2010

Data

Full-text available

Jan 2013

Informal lightweight knowledge extraction from documents

Book

Full-text available

Jan 2013

Conference code: 92428, Export Date: 14 December 2012, Source: Scopus, doi: 10.1007/978-3-642-28807-4_25, Language of Original Document: English, Correspondence Address: Colace, F.; Department of Information Engineering and Electrical Engineering, University of Salerno, Fisciano 84084, Italy; email: fcolace@unisa.it, References: Cimiano, P., (2006) Ontology Learning and Population from Text: Algorithms, Evaluation and Applications, , Springer;

Informal Lightweight Knowledge Extraction from Documents

Chapter

Full-text available

Jan 2013

In this paper, we propose a method to automatically extract informal knowledge from a collection of documents. The method is mainly based on the definition of a kind of informal knowledge representation consisting of concepts (lexically indicated by words) and the links between them. We show that links can be inferred from documents through the use of the probabilistic topic model while the overall parameters optimisation procedure, based on a suitable score function, can be carried out through the Random Mutation Hill-Climbing algorithm. Experimental findings show that our method is effective and that, as side effects, the score function can be employed as a criterion to compute the homogeneity between documents, which can be considered as a prelude to a classification procedure. © 2013 Springer-Verlag GmbH.

ProdLight: A Lightweight Ontology for Product Description Based on Datatype Properties

Conference Paper

Apr 2007

Martin Hepp

Web pages representing offerings of products and services are a major source of data for Semantic Web-based e-commerce. This data could be useful for numerous applications, e.g. (1) more precise product search engines and shopping bots, (2) aggregation or enrichment of multi-vendor catalogs using public product descriptions, or (3) the automated discovery of additional alternatives based on the combination of multiple items. While there are already some ontologies for products and services available, they are very large in size (20 - 70,000 classes), and thus not always suitable as ontology imports. In this paper, we take a different approach: We represent the semantics of offerings on the Web using a very lightweight ontology of datatype properties in combination with popular classifications like UNSPSC and eCl@ss. We then demonstrate how this representation can be mapped easily to comprehensive ontologies for products and services like eClassOWL1. Our approach provides a straightforward solution for annotating offerings on the Web while avoiding the overhead of importing fully-fledged products and services ontologies in any single annotation. We can show that our proposal has technical advantages and eliminates legal problems when reusing existing standards.

An Interactive Platform for Multilingual Linguistic Resource Enrichment

Conference Paper

Full-text available

Oct 2014

The world is extremely diverse and its diversity is obvious in the cultural differences and the large number of spoken languages being used all over the world. In this sense, we need to collect and organize a huge amount of knowledge obtained from multiple resources differing from one another in many aspects. A possible approach for doing that is to think of designing effective tools for construction and maintenance of linguistic resources based on well-defined knowledge representation methodologies capable of dealing with diversity and the continuous evolvement of human knowledge. In this paper, we present a linguistic resource management platform which allows for knowledge organization in a language-independent manner and provides the appropriate mapping from a language independent concept to one or more language specific lexicalization. The paper explains the knowledge representation methodology used in constructing the platform together with the iterative process followed in designing and implementing the first version of the platform, named UKC-1 and the updated refined version, named UKC-2. Copyright © 2014 SCITEPRESS - Science and Technology Publications All rights reserved.

Methodik zur automatisierten Extraktion und Klassifikation semistrukturierter Produkt-und Adressdaten aus Webseiten

Article

Jan 2011

Evgeny Baranovskiy

Diese Arbeit stellt eine neue Methodik für die automatisierte Extraktion und Klassifikation von Daten aus Webseiten vor. Die Methodik EH ("Extraction Heuristics") ist für die Domänen der Produkt- und Adressdaten konzipiert und erlaubt die Erweiterung um zusätzliche Domänen. Der Bedarf nach einer solchen Methodik ist groß, weil die Vielfalt von Informationen auf Websites eine lukrative Datenquelle darstellt. Mit den vorhandenen Werkzeugen und Verfahren lassen sich die Inhalte von Websites nur in einem begrenzten Umfang extrahieren, wobei sich eine Reihe von Nachteilen für den Benutzer ergeben. Zudem bieten die vorhandenen Werkzeuge keinerlei Möglichkeit zur Klassifikation der extrahierten Daten. Die Methodik EH bietet einen einfachen und erweiterbaren Prozess, der alle Teilaufgaben der Extraktion und Klassifikation von Daten aus Webseiten abdeckt und durch das hohe Maß an Automatisierung den Benutzer entlastet. Mit der prototypischen Implementierung der Methodik EH in einer Anwendung xScraper wurden fünfzig Websites der Datenextraktion und Klassifikation unterzogen. Die Evaluation anhand von verschiedenen Kriterien hat die Wirksamkeit der Methodik bewiesen.

Managing Ubiquitous Scientific Knowledge on Semantic Web

Chapter

Hao Xu

Managing ubiquitous scientific knowledge is a part of daily life for scholars, while it also becomes a hot topic in the Semantic Web research community. In this paper, we propose a SKO Types framework aiming to facilitate managing ubiquitous Scientific Knowledge Objects (SKO) driven by semantic authoring, modularization, annotation and search. SKO Types framework comprises SKO Metadata Schema, SKO Patterns and SKO Editor corresponding to metadata layer, ontology layer and interface layer respectively. SKO Metadata Schema specifies sets of attributes describing SKOs individually and relationally. SKO Patterns is a three-ontology based model in order to modularize scientific publications syntactically and semantically, while SKO Editor supplies a LaTex-like mark-up language and editing environment for authoring and annotating SKOs concurrently.

RelBAC: Relation based access control

Conference Paper

Full-text available

Jan 2009

The Web 2.0, GRID applications and, more recently, semantic desktop applications are bringing the Web to a situation where more and more data and metadata are shared and made available to large user groups. In this context, metadata may be tags or complex graph structures such as file system or web directories, or (lightweight) ontologies. In turn, users can themselves be tagged by certain properties, and can be organized in complex directory structures, very much in the same way as data. Things are further complicated by the highly unpredictable and autonomous dynamics of data, users, permissions and access control rules. In this paper we propose a new access control model and a logic, called RelBAC (for Relation Based Access Control) which allows us to deal with this novel scenario. The key idea, which differentiates RelBAC from the state of the art, e.g., Role Based Access Control (RBAC), is that permissions are modeled as relations between users and data, while access control rules are their instantiations on specific sets of users and objects. As such, access control rules are assigned an arity which allows a fine tuning of which users can access which data, and can evolve independently, according to the desires of the policy manager(s). Furthermore, the formalization of the RelBAC model as an Entity-Relationship (ER) model allows for its direct translation into Description Logics (DL). In turn, this allows us to reason, possibly at run time, about access control policies.

The Description Logic Handbook: Theory, Implementation, and Applications

Book

Full-text available

Jan 2007

Clustering documents in a web directory

Conference Paper

Jan 2003

Data Clustering: A Review

Article

Jan 1999

Classification, Second Edition (Monographs on Statistics and Applied Probability, 82)

Book

Jun 1999

A D Gordon

{As the amount of information recorded and stored electronically grows ever larger, it becomes increasingly useful, if not essential, to develop better and more efficient ways to summarize and extract information from these large, multivariate data sets. The field of classification does just that-investigates sets of }objects to see if they can be summarized into a small number of classes comprising similar objects.Researchers have made great strides in the field over the last twenty years, and classification is no longer perceived as being concerned solely with exploratory analyses. The second edition of Classification incorporates many of the new and powerful methodologies developed since its first edition. Like its predecessor, this edition describes both clustering and graphical methods of representing data, and offers advice on how to decide which methods of analysis best apply to a particular data set. It goes even further, however, by providing critical overviews of recent developments not widely known, including efficient clustering algorithms, cluster validation, consensus classifications, and the classification of symbolic data.The author has taken an approach accessible to researchers in the wide variety of disciplines that can benefit from classification analysis and methods. He illustrates the methodologies by applying them to data sets-smaller sets given in the text, larger ones available through a Web site.Large multivariate data sets can be difficult to comprehend-the sheer volume and complexity can prove overwhelming. Classification methods provide efficient, accurate ways to make them less unwieldy and extract more information. Classification, Second Edition offers the ideal vehicle for gaining the background and learning the methodologies-and begin putting these techniques to use.

Introduction to mathematical logic. 3rd ed

Article

Feb 1966

Elliott Mendelson

Clustering documents in a web directory

Conference Paper

Nov 2003

Hierarchical categorization of documents is a task receiving growing interest due to the widespread proliferation of topic hierarchies for text documents. The worst problem of hierarchical supervised classifiers is their high demand in terms of labeled examples, whose amount is related to the number of topics in the taxonomy. Hence, bootstrapping a huge hierarchy with a proper set of labeled examples is a critical issue. In this paper, we propose some solutions for the bootstrapping problem, implicitly or explicitly using a taxonomy definition: a baseline approach where documents are classified according to class labels, and two clustering approaches, where training is constrained by the a-priori knowledge of the taxonomy structure, both at terminological and topological level. In particular, we propose the TaxSOM model, that clusters a set of documents in a predefined hierarchy of classes, directly exploiting the knowledge of both their topological organization and their lexical description. Experimental evaluation was performed on a set of taxonomies taken from the Google Web directory.

Making Explicit the Semantic Hidden in Schema Models

Article

Jan 2003

Most of the data stored in the Semantic Web are organized in schema models, that can be represented as labeled graphs where labels are short natural language expressions. Examples of schema models are ER-schema automata, ontologies, taxonomies, and Web Directories. The semantics of schema

Wordnet: a lexical database

Article

Jan 1995

G. A. Miller

Introduction to Mathematical Logic

Article

Jan 1997

E. Mendelson

Text Classification from Labeled and Unlabeled Documents using EM

Article

May 2000

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.

Towards a Theory of Formal Classification

Abstract and Figures

Recommended publications

Encoding Classifications as Lightweight Ontologies

Formalizing the Get-Specific Document Classification Algorithm

Lightweight Ontologies

Semantic Matching: Algorithms and Implementation