Conference PaperPDF Available

Integrating Data from Maps on the World-Wide Web

December 2006

December 2006

DOI:10.1007/11935148_17

Source
DBLP

Conference: Web and Wireless Geographical Information Systems, 6th International Symposium, W2GIS 2006, Hong Kong, China, December 4-5, 2006, Proceedings

Authors:

Eli Safra

Technion - Israel Institute of Technology

Yaron Kanza

AT&T Labs - Research

Yerach Doytsher

Technion - Israel Institute of Technology

A substantial amount of data about geographical entities is available on the World-Wide Web, in the form of digital maps. This paper investigates the integration of such data. A three-step integration process is presented. F irst, geo- graphical objects are retrieved from Maps on the Web. Secondly, pairs of objects that represent the same real-world entity, in different maps, are disco vered and the information about them is combined. Finally, selected objects are presented to the user. The proposed process is efficient, accurate ( i.e., the discovery of cor- responding objects has high recall and precision) and it can be applied to any pair of digital maps, without requiring the existence of specific attributes. For th e step of discovering corresponding objects, three new algorithms are presented. These algorithms modify existing methods that use only the locations of geographical objects, so that information additional to locations will be utilized in the process . The three algorithms are compared using experiments on datasets with varying levels of completeness and accuracy. It is shown that when used correctly, ad- ditional information can improve the accuracy of location-based methods even when the data is not complete or not entirely accurate.

A Map from Google Earth. Fig. 2. A map from Yahoo Maps.

…

Tests on real-world data

…

Results of Test I Fig. 6. Results of Test II

…

Results of NW for varying accuracy Fig. 8. Results of NW for varying completeness

…

Figures - uploaded by Yaron Kanza

Content may be subject to copyright.

Content uploaded by Yaron Kanza

Content may be subject to copyright.

Integrating Data from Maps on the World-Wide Web

Eliyahu Safra1, Yaron Kanza⋆2, Yehoshua Sagiv⋆⋆ 3, and Yerach Doytsher1

1Department of Transportation and Geo-Information, Technion, Haifa, Israel

{safra, doytsher}@technion.ac.il

2Department of Computer Science, University of Toronto, Toronto, Canada

yaron@cs.toronto.edu

3School of Engineering and Computer Science, The Hebrew University, Jerusalem, Israel

sagiv@cs.huji.ac.il

Abstract. A substantial amount of data about geographical entities is available

on the World-Wide Web, in the form of digital maps. This paper investigates the

integration of such data. A three-step integration process is presented. First, geo-

graphical objects are retrieved from Maps on the Web. Secondly, pairs of objects

that represent the same real-world entity, in different maps, are discovered and

the information about them is combined. Finally, selected objects are presented

to the user. The proposed process is efﬁcient, accurate (i.e., the discovery of cor-

responding objects has high recall and precision) and it can be applied to any pair

of digital maps, without requiring the existence of speciﬁc attributes. For the step

of discovering corresponding objects, three new algorithms are presented. These

algorithms modify existing methods that use only the locations of geographical

objects, so that information additional to locations will be utilized in the process.

The three algorithms are compared using experiments on datasets with varying

levels of completeness and accuracy. It is shown that when used correctly, ad-

ditional information can improve the accuracy of location-based methods even

when the data is not complete or not entirely accurate.

1 Introduction

Many maps are available on the World-Wide Web, providing information on geograph-

ical entities. The information consists of both spatial and non-spatial properties of the

entities. Examples of spatial properties are location and shape of an entity. Examples

of non-spatial properties are name and address. The goal of integrating two maps is to

enable applications and users to easily access the properties that are available in either

one of those maps. Another reason for integration is that some geographical entities

may appear in only one of the maps. Integration increases the likelihood that for all the

relevant entities, in a speciﬁed geographical area, objects that represent these entities

are presented to the user.

An integration of two maps consists of the following three steps: extracting geo-

graphical objects from the maps, discovering pairs of objects that represent the same

real-world entity in different sources (such objects are called corresponding objects)

⋆This author was supported by an NSERC grant.

⋆⋆ This author was supported by The Israel Science Foundation (Grant 893/05).

and presenting the result to the user. This paper deals mainly with the second step of dis-

covering corresponding objects. We use the term matching algorithm for an algorithm

that discovers corresponding objects in two given datasets of geographical objects.

Methods for integrating data from the Web, and especially matching algorithms,

should be able to cope with the following characteristics of the Web.

– Data on the Web is heterogeneous. This means that the same piece of information

can have different forms in different sources. For example, in different sources, the

name of a geographical entity can have different spellings or can be written in dif-

ferent languages. This makes it difﬁcult for integration methods to use properties,

such as names, for discovering corresponding objects. Another aspect of hetero-

geneity is incompleteness. Some attributes may not be available in some sources or

not speciﬁed for some objects.

– Data may change frequently. For example, maps that contain hotels may also in-

clude reviews that are regularly added and updated by people who have stayed

in those hotels. In such cases, the integration should be performed in real time,

i.e., when the user sends her request for information. Otherwise, the integrated data

will not reﬂect the most recent changes in the sources. Consequently, an integration

method for data on the Web must be efﬁcient, especially if the method is used in a

Web service that handles many requests concurrently.

– Data on the Web can be incorrect or inaccurate. Hence, on one hand, integration

methods should rely mostly on object properties that are relatively accurate. On

the other hand, this justiﬁes using, in Web applications, approximation matching

algorithms, i.e., highly (but not completely) accurate algorithms for discovering

corresponding objects.

Because of the above reasons, in this paper we consider techniques that start with

location-based matching algorithms and improve them. Relying primarily on locations

has the following three advantages. First, locations are always available for spatial ob-

jects and their degree of accuracy can be determined relatively easily. Hence, location-

based matching algorithms can be applied to objects from any pair of maps. Second,

location-based methods are suitable for integration of heterogeneous data, since it is

easy to compare a pair of locations even when they are stored or measured in different

ways. Third, there exist efﬁcient location-based matching algorithms.

Location-based matching algorithms that are both efﬁcient and effective were pre-

sented in the past [2,3]. These algorithms only use locations for ﬁnding corresponding

objects. Yet, in many cases, the accuracy of the integration can be improved signiﬁcantly

by using attributes of the integrated objects in addition to locations. This is especially

important when dealing with data from the Web, where locations may be inaccurate.

In this paper, we explain how to use properties of integrated objects to increase the

effectiveness of location-based matching algorithms.

The main contributions of this paper are as follows. First, a complete process of

integrating data from maps on the Web is presented. This process is efﬁcient and gen-

eral, in the sense that it can be applied to any pair of maps. Secondly, we show how,

in addition to locations, attributes of the objects can be used in the integration process.

Speciﬁcally, we present three new matching algorithms that use locations as well as

additional information. Thirdly, we describe the results of thorough experiments, on

datasets with different levels of accuracy and completeness, showing that additional

information can improve the results of location-based matching algorithms, when that

information is used appropriately.

The structure of the paper is as follows. In Section 2 we present our methods using a

real-world example of integrating maps showing hotels in the Soho area of Manhattan,

New-York. We present our three new methods in Section 3. In Section 4, we provide

the results of experiments we conducted on both real-world data and syntactically gen-

erated data. Also, we compare our methods based on the experimental results. Finally,

in Section 5, we discuses related work and conclude.

2 The Integration Process

We start by presenting our approach to integration of data from maps on the Web. We

do that using an example showing integration of information about hotels in the Soho

area of Manhattan, New-York. The data sources we used are Google Earth4and Yahoo

Maps5. Google Earth is a service that provides a raster image of almost any part of earth.

On top of the raster image it shows information such as roads, hotels, and restaurants.

In our example we are interested in information about hotels. For hotels, Google Earth

provides their names. The names are links that lead to additional information, e.g., by

following a link the address of the hotel is provided. A result of a search in Google

Earth for hotels in Soho is depicted in Fig. 1.

Yahoo Maps provides road maps for some major cities in the world. As in Google

Earth, maps include touristic information; however, in Yahoo, hotel names are not pre-

sented on the maps. Instead, a hotel is shown using an icon of a yellow square containing

a red circle, in the location of the hotel. The name of the hotel and additional informa-

tion such as the rank (i.e., number of stars) and price are available for one hotel at a

time. Two possible reasons for not writing hotel names on the map are (1) making the

presentation of the map simpler and easier to read (cartographic reasons), and (2) re-

stricting the information released per each user request, so that applications will not be

able to retrieve all the data from Yahoo to their local database (commercial reasons). A

result of a search in Yahoo Maps for hotels in Soho is depicted in Fig. 2.

It may seem a good solution to use, in the hotel scenario, a matching algorithm that

consider as corresponding objects, pairs of hotels that have the same name. However,

because names of hotels are not presented on maps from Yahoo, a matching based

on names is problematic. Two other difﬁculties in using hotel names in a matching

algorithm are the uncertainty in deciding whether two names refer to the same hotel

and the presence of errors in the data. In our case, uncertainty is due to the existence

of several hotels with similar names in the area we consider. For instance, consider the

following hotel names ”Grand Hotel”, ”Soho Grand Hotel” and ”Tribeca Grand Hotel”.

Are these the names of three different hotels or of only two different hotels? Another

case of uncertainty is when a hotel has more than one name. In the Soho area, the hotel

named ”Howard Johnson Express Inn” according to Google Earth, is named ”Metro

Three Hotel Llc” in Yahoo Maps, and indeed these are two names of the same hotel.

4http://earth.google.com

5http://maps.yahoo.com

Fig.1. A Map from Google Earth. Fig.2. A map from Yahoo Maps.

In this work we propose the following three-step integration process. (1) Retrieve

the maps, extract relevant objects from the maps and compute the location of the objects.

(2) Apply a matching algorithm for ﬁnding pairs of corresponding objects. (3) Display

objects to the user (or return them as a dataset) where each pair of corresponding objects

is represented by a single object. Objects that do not belong to any pair of corresponding

objects may also be presented.

We now illustrate these steps using the Soho-hotels scenario. Initially, a search for

hotels in Soho, New-York was made in both Google Earth and Yahoo Maps, and two

result images were retrieved (the images shown in Fig. 1 and Fig. 2). The two images

that were found in the search were oriented using geo-referencing. Then, geographical

objects were generated by digitizing the maps, that is, by identifying in the raster images

icons of hotels and calculating the locations of the hotels based on the geo-referencing.

In this example scenario, hotel names were inserted by a human user. In the future

we expect many maps on the Web to be in formats that computers can easily process

without the need of human intervention. GML (Geographic Markup Language) [1] is

an example of such a format.

The second step was to apply a matching algorithm to the two datasets that were

extracted from the maps. The result of this step consists of pairs of objects that represent

the same hotel, and of singletons representing hotels that appear in only one of the

sources. More details about the matching algorithm will be given in the next section.

The ﬁnal step of the integration is displaying to the user the pairs and singletons

produced by the matching algorithm. Before providing the results, conditions can be

used for selecting which objects to display. Note that ﬁltering the results at this step

makes it possible to apply conditions that use attributes from both sources.

3 Matching Algorithms

The most involved part of an integration process is the discovery of corresponding ob-

jects, i.e., the matching algorithm. Several matching algorithms that use only the loca-

tion of objects were proposed in the past [2,3]. We now present three new algorithms

that are built upon existing location-based algorithms and use attributes of objects for

improving the matching.

3.1 Framework

First, we present our framework. A dataset is a collection of geographical objects that

are extracted from a given map. Each object represents a single real-world geographical

entity and has a point location. (For an object that has a polygonal shape, we consider

the center of mass of the polygonal shape to be the point location of the object.) The

distance between two objects is the Euclidean distance between their point locations.

We denote by distance(a, b)the distance between two objects aand b.

An object may have, in addition to location, attributes that contain information about

the entity that the object represents. We distinguish between two types of attributes. An

attribute Iof objects in a dataset Ais unique if every two objects in Ahave different

values for I,i.e., Iis a candidate key. We consider Ias non-unique if there can be two

objects in Athat have the same value for I. For example, in a dataset of hotels, the name

of a hotel is a unique attribute, since it is unlikely that two hotels in the same vicinity

will have the same name. We consider rating (number of stars) as non-unique, because

two proximate hotels may have the same number of stars. When locations of objects are

not accurate, we can improve a basic matching algorithm by using additional attributes.

If the additional information is correct, a unique attribute can be used for discovering

pairs of corresponding objects that the basic algorithm fails to match. Both unique and

non-unique attributes can be used for detecting pairs of non-corresponding objects that

are, wrongly, deemed corresponding by a matching algorithm.

In integration of maps, locations of objects are not accurate, because the process of

extracting objects and computing their locations, by digitizing an image, introduces er-

rors. Furthermore, maps on the Web may not be accurate to begin with. Thus, given two

datasets Aand Bthat are extracted from two maps, two corresponding objects a∈A

and b∈Bmay not have the same location. Yet, for each dataset, errors are normally

distributed with some standard deviation σ. So, for 98.8% of the objects, their distance

from the real-world entity that they represent is less than or equal to 2.5σ. Hence, for

98.8% of the pairs {a, b}of corresponding objects, it holds that distance(a, b)≤β,

where β=p(2.5σA)2+ (2.5σB)2is the distance bound of Aand B(σAand σBare

the standard deviations of the error distributions in Aand B, respectively). In our algo-

rithms, pairs {a, b}with distance(a, b)> β are never deemed corresponding objects.

A matching algorithm receives a pair of datasets Aand Band returns two sets P

and S. The set Pconsists of pairs {a, b}, such that a∈Aand b∈Bare likely to be

corresponding objects. The set Sconsists of singletons {s}(where s∈A∪B) such that,

with high likelihood, sdoes not have a corresponding object. Location-based matching

algorithms compute the sets Pand Saccording to the distance between objects.

3.2 The New Matching Algorithms

We now describe three new algorithms that receive an existing matching algorithm M

and improve it by using the information provided by some speciﬁed attributes. We di-

vide the input to these algorithm into two parts. One part consists of two datasets Aand

Pre-D

[M,X](A, B )

Parameters: A matching algorithm M, a set of unique attributes X

Input: Datasets Aand B

Output: A set Pof pairs and a set Sof singletons

1: P← ∅, S ← ∅, A′←A, B′←B

2: let βbe the distance bound of Aand B

3: for each a∈Aand b∈Bsuch that a.x =b.x for some attribute x∈Xdo

4: if distance(a, b)≤βthen

5: P←P∪ {a, b}

6: A′←A′− {a}, B′←B′− {b}

7: (P′, S′)← M(A′, B ′)

8: P←P∪P′, S ←S′

9: return (P, S)

Post-R

[M,X](A, B )

Parameters: A matching algorithm M, a set of attributes X

Input: Datasets Aand B

Output: A set Pof pairs and a set Sof singletons

1: (P, S)← M(A, B)

2: for each {a, b} ∈ Psuch that a.x 6=b.x for some attribute x∈Xdo

3: P←P− {a, b}

4: return (P, S)

Pre-F

[M,X,φ](A, B)

Parameters: A matching algorithm M, a set of non-unique attributes X, a factor φ

Input: Datasets Aand B

Output: A set Pof pairs and a set Sof singletons

1: P← ∅, S ← ∅

2: let distancen(x, y)be a new distance function that, initially, is equal to

distance(x, y)

3: for each a∈Aand b∈Bsuch that a.x 6=b.x for some attribute x∈Xdo

4: distancen(a, b)←φ·distance(a, b)

5: let Mnbe the matching algorithm Mwhen run using the distance function

distancen(x, y)instead of using the Euclidean distance function distance(x, y)

6: (P, S)← Mn(A, B)

7: return (P, S)

Fig.3. The algorithms Pre-process detection, Post-process removal and Pre-process factorizing

Bthat should be joined. The second part consists of M, a set Xof the given attributes

and, for the third algorithm, an additional factor φ. We denote by Pand Sthe set of

pairs and the set of singletons, respectively, that the algorithms return. The pseudocode

of all three algorithms is presented in Fig. 3.

Pre-process detection (

Pre-D

)

The

Pre-D

algorithm uses unique attributes for detecting corresponding objects, and

then it calls another matching algorithm on the remaining objects. The algorithm has

two steps.

1. For each pair of objects a∈Aand b∈B, such that aand bhave the same value

for some unique attribute of Xand the distance between them does not exceed the

distance bound of Aand B, the pair {a, b}is added to P,ais removed from Aand

bis removed from B.

2. The matching algorithm Mis applied to the remaining objects of Aand B. Upon

termination, the pairs of the result are added to Pand the singletons—to S.

Post-process removal (

Post-R

)

The

Post-R

algorithm uses a set of attributes Xfor detecting pairs of objects that are

erroneously matched by another algorithm. The

Post-R

algorithm has two steps.

1. The matching algorithm Mis applied to Aand B. The result is a set Pof pairs and

a set Sof singletons.

2. For each pair of objects {a, b}in P, such that aand bhave different values for some

attribute of X, the pair {a, b}is removed from P.

Pre-process distance factorization (

Pre-F

)

The

Pre-F

algorithm uses a set Xof non-unique attributes as follows. For every pair of

objects a∈Aand b∈Bthat have different values for some attribute of X, the distance

between aand bis multiplied by the given factor φ > 1. Note that increasing the

distance between objects lowers the probability that they will be matched by a location-

based algorithm. The algorithm Muses the new distances to join Aand B.

In our experiments, we tested eight different combinations of the above algorithms.

Suppose that the set Ycontains the shared attributes of two datasets Aand B. Let

unique(Y)and non-unique(Y)be the sets of unique and non-unique attributes of Y,

respectively. Given a location-based matching algorithm M, the following are the eight

possible ways of computing the matching of Aand B.

1. Use only the location based algorithm M,i.e., return M(A, B).

2. Use

Post-R

with M. That is, return

Post-R

[M,Y ](A, B).

3. Use

Pre-D

with M. That is, return

Pre-D

[M,unique(Y)](A, B ).

4. Combine

Pre-D

and

Post-R

,i.e., return

Post-R

[

Pre-D

[M,unique(Y)],Y ](A, B ).

5. Use

Pre-F

with M. That is, return

Pre-F

[M,non-unique(Y),φ](A, B).

6. Combine

Post-R

with

Pre-F

,i.e., return

Post-R

[

Pre-F

[M,non-unique(Y),φ],Y ](A, B).

7. Combine

Pre-D

with

Pre-F

. That is, return the result of the following expression:

Pre-D

[

Pre-F

[M,non-unique(Y),φ],unique(Y)](A, B ).

8. Combine all the three methods by applying

Pre-F

Pre-D

,Mand, ﬁnally,

Post-R

i.e., return

Post-R

[

Pre-D

[

Pre-F

[M,non-unique(Y),φ]

,unique(Y)],Y ](A, B ).

3.3 Computing the Distance Bound

Applying a matching algorithm requires knowing the distance bound β(or an approxi-

mation of it). The approximation of βis computed based on approximations of σAand

σB—the standard deviations of the error distributions in the integrated datasets (see

Section 3.1). The values σAand σB(we also call them the errors of the datasets) are

sometimes provided with the maps, and in other cases we need to estimate them.

The error of a dataset is caused by errors in the procedure of collecting and process-

ing the geographical data. The procedure is different when generating raster (imagery)

maps and when vector (feature based) maps are produced. (See [11] for more detailed

descriptions of these procedures.)

Raster maps are typically generated from satellite or aerial photographs. There are

three main causes of error in the process of creating raster maps. First, errors are intro-

duced when the photos are orthorectiﬁed i.e., when correcting the photos to accurately

represent the surface of the earth. Second, the size of the pixels in the photo affects the

error. Currently, a resolution of 70cm per pixel at nadir is common in satellite photos

(e.g., in the two main high-resolution commercial earth-observation satellites IKonos

and QuickBird). The ﬁrst two factors are relatively small and the main cause of error is

the third factor which is the accuracy of the geo-referencing process i.e., the accuracy of

matching earth coordinated to the photo. The accuracy of the geo-referencing depends

on the existence and accuracy of reference points. When no reference points exist, the

accuracy is about 10 meters, while when there are reference points, the accuracy is

about 1–10 meters, depends on the accuracy of the reference points. Extracting features

from the raster image (e.g., identifying the location of an hotel) also introduces an er-

ror which is approximately the number of pixels of the error in the extraction process

multiplied by the resolution.

Vector maps are usually created either by governmental mapping agencies, or by

commercial companies, according to an agreed mapping standard. The standard deﬁnes

accuracy requirements that depend on the map scale. Typically, for urban areas, map

scales are between 1/1000–1/10000. Normally, the required accuracy for such scales

is about 0.3–0.4mm. This means that at a scale of 1/1000, the error is about 0.3–0.4

meters. At a scale of 1/10000, the error is approximately 3–4 meters.

3.4 Measuring the Quality of the Result

We use recall and precision to measure the accuracy of a matching algorithm. Remem-

ber that the result of a matching algorithm consists of sets (singletons and pairs). A set

is correct if it is either a pair of corresponding objects or a single object that has no

corresponding object. Given the result of a matching algorithm, the recall is the ratio of

the number of correct sets in the result to the number of all correct sets. For example, a

recall of 0.8 means that 80% of the correct sets appear in the result. The precision is the

ratio of the number of correct sets in the result to the number of sets in the result. For

example, a precision of 0.9 means that 90% of the sets in the result are correct.

In our experiments, we knew exactly which sets were correct and, hence, were able

to determine the precision and recall. For synthetic data, all the information about the

data was available to us. For real-world data, we determined the correct sets manually,

using all the available information.

4 Experiments

In this section, we describe the results of extensive experiments on both real-world and

synthetically generated data. The goal of our experiments was to compare the eight

combinations, presented in Section 3.2, over data with varying levels of inaccuracy

and incompleteness. We also wanted to determine by how much our methods improve

existing location-based algorithms. For that, we tested the effect of our methods on

the following three location-based algorithms: nearest-neighbor (NN), mutually-nearest

(MUTU) and normalized-weights (NW); see [3] for a description of these algorithms.

4.1 Tests on Real-World Data

Fig.4. Tests on real-world data

We present the results of integrating the maps

of hotels in Soho as described in Section 2. The

Google-Earth map presents 28 hotels and the map

from Yahoo Maps presents 39 hotels and inns. A

total number of 44 hotels and inns appear in these

sources, where 21 hotels appear in both of the

sources while 23 appear in only one source. For

both sources, we used an error (σ) of 100 meter

because identifying the location of an hotel based

on an icon is highly inaccurate.

Figure 4 shows the harmonic mean of the re-

call and precision (HRP) for the three location-

based algorithms (NW, MUTU, NN). Each one of

the three algorithms was tested according to the ﬁrst four combinations of Section 3.2.

(The other four combinations are not applicable, since the only attribute, hotel name,

is unique.) The third combination,

Pre-D

, is clearly the best for each of the three al-

gorithms. It is slightly better than the fourth combination, which includes both

Pre-D

and

Post-R

, since the attribute hotel name is not always accurate (e.g., one hotel has

different names in the two sources). For comparison, Figure 4 also shows the result of

matching just according to hotel names. Note that for combinations 2–4, the process

was semi-automatic, since hotel names do not appear in Yahoo Maps.

4.2 Tests on Synthetic Data

In order to test our methods on data with varying levels of accuracy and incompleteness,

we randomly generated synthetic datasets using a two-step process. First, the real-world

entities are generated. The locations of these entities are randomly chosen, according to

a uniform distribution, in a square area. Each entity has one unique attribute Uand one

non-unique attribute Nwith randomly-chosen values. The non-unique attribute has ﬁve

possible values (as for the number of stars of a hotel). In the second step, the objects

in each dataset are generated. Each object is associated with a distinct entity and its

location is chosen with an error that is normally distributed (relative to the location of

the entity). In each dataset, different objects correspond to distinct entities. For each

object, the attribute Uhas either the same value as in the corresponding entity, null (for

Fig.5. Results of Test I Fig.6. Results of Test II

incompleteness) or an arbitrary random value (for inaccuracy). We denote by c(U)the

percentage of objects that have a non-null value for Uand by a(U)the percentage of

objects that have either the correct value or null. Values are similarly assigned to N.

We present the results of two tests. In Test I, the values of the attributes are either

accurate or missing (i.e., null). In Test II, all the objects have values for Uand N, but

some of those values are inaccurate. In both tests, there are 1000 entities in a square

area of 1350 ×1350 meters with a minimal distance of 15 meters between entities.

Each dataset has 750 objects that are randomly chosen for 750 entities using a standard

deviation of σ= 12 meters for the error distribution. In Test I, the attributes in each

dataset have either the correct values or nulls as follows: a(U) = a(N) = 100%,

c(U) = 40% and c(N) = 60%. That is, only 40% of the objects have the correct value

for the unique attribute and only 60% of the objects have the correct value for the non-

unique attribute (if the value is not the correct one, then it is null). In Test II, attributes

always have non-null values but not necessarily the correct ones, i.e., c(U) = c(N) =

100% and a(U) = a(N) = 80%.

In Test I and Test II, we tried the eight combinations of Section 3.2 with each of the

three algorithms. The results, depicted in Fig. 5 and. 6, show the harmonic mean of the

recall and precision for the eight combinations involving each algorithm. Each bar is

for the combination identiﬁed by the number on that bar. For comparison, we also show

the result obtained by a matching algorithm that only uses the unique attribute (Name).

Test I shows that when information is partial but accurate, the eighth combination

that uses all of the three algorithms (

Pre-D

Post-R

and

Pre-F

) is the best. Test II shows

that when information is inaccurate,

Post-R

is not effective (as was also the case for the

real-world data) and it is better to use just

Pre-D

and

Pre-F

(the seventh combination).

Figures 7 and 8 show the performance of the NW method for varying levels of

completeness and accuracy. In Figure 7, the accuracy varies, i.e., a(U) = a(N) =

70% ...100%, and the completeness is ﬁxed, i.e., c(U) = c(N) = 100%. In Figure 8,

the completeness varies, i.e., c(U) = c(N) = 40% ...100%, and the accuracy is ﬁxed,

i.e., a(U) = a(N) = 100%. In each graph, the serial number refers to the combina-

tion that produced the graph. Note that the results of only 6 methods (1,2,3,5,7,8) are

presented, since the other two are inferior.

The followings are our conclusions from the tests.

1. When there is a unique attribute, it is always good to identify pairs and remove

them from the matching algorithm (Method 2).

Fig.7. Results of NW for varying accuracy Fig. 8. Results of NW for varying completeness

2. When there is a non-unique attribute, it is always good to use factorized distance

(Method 5).

3. Although additional information improves the quality of the results, the main factor

that determines the quality is still the location-based algorithm.

4. When the attributes are not accurate, using the additional information before the

matching improves the quality of the result. But using it after the location-based

matching has a negative effect, for the following reason. While there is only a low

probability that two proximate yet non-corresponding objects have the same value

for a unique attribute,there is a considerably higher probability that two correspond-

ing objects have different values for some unique attribute.

The tests show that in all cases using additional attribute before applying a location-

based matching algorithm improves the quality of the results. Applying additional in-

formation at the end yields an improvement only if that information is accurate.

5 Conclusion

Traditionally, integration of geo-spatial data is being done using map conﬂation [13,

6]. However, map conﬂation is not efﬁcient since whole maps are integrated, not just

selected objects. Thus, conﬂation is not suitable for Web applications or in the context

of mediators [4,12,19,20] where users request answers to speciﬁc queries. Integrating

spatial datasets using only geometrical or topological properties [2, 3,14] or using only

alpha numeric attributes [9,10], both do not use all the available information but can be

combined using the approach we introduced in this paper.

Other approaches use both spatial and non-spatial attributes (e.g. [7,15, 17]). How-

ever, these approaches some time remain on the schema level, rather than actually

matching the objects, such as [7], or has large computation time as [15,17].

In this work we showed how data from maps on the Web can be integrated using

location-based algorithms, and how to utilize information additional to location when

such information exists. We presented three new matching algorithms and tested them

on data with varying levels of incompleteness and inaccuracy. Interestingly, our exper-

iments show that when the additional information is accurate it should be used both

before and after the location-based matching process. When the additional information

is not very accurate, the information should be used only prior to the location-based

matching process. Our experiments show that the new algorithms improve the existing

location-based matching algorithms.

References

1. Geographic Markup Language (GML). http://www.opengeospatial.org/standards/gml.

2. C. Beeri, Y. Doytsher, Y. Kanza, E. Safra, and Y. Sagiv. Finding corresponding objects when

integrating several geo-spatial datasets. In ACM-GIS, pages 87–96, 2005.

3. C. Beeri, Y. Kanza, E. Safra, and Y. Sagiv. Object fusion in geographic information systems.

In VLDB, pages 816–827, 2004.

4. O. Boucelma, M. Essid, and Z. Lacroix. A WFS-based mediation system for GIS interoper-

ability. In ACM-GIS, pages 23–28, 2002.

5. T.Bruns and M.Egenhofer. Similarity ofspatial scenes. In SDH, pages 31–42, Delft (Nether-

lands), 1996.

6. M. A. Cobb, M. J. Chung, H. Foley, F. E. Petry, and K. B. Show. A rule-based approach for

conﬂation of attribute vector data. GioInformatica, 2(1):7–33, 1998.

7. T. Devogele, C. Parent, and S. Spaccapietra. On spatial database integration. In IJGIS,

Special Issue on System Integration, 1998.

8. F. T. Fonseca and M. J. Egenhofer. Ontology-driven geographic information systems. In

ACM-GIS, pages 14–19, Kansas City (Missouri, US), 1999.

9. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava.

Approximate string joins in a database (almost) for free. In VLDB, pages 491–500, 2001.

10. L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an RDBMS for web

data integration. In Proceedings of the 12th international conference on World Wide Web,

pages 90–101, 2003.

11. J. C. McGlone. Manual of Photogrammetry, Fifth Edition. American Society of Photogram-

metry and Remote Sensing, 2004.

12. Y. Papakonstantinou, S. Abiteboul, and H. Garcia-Molina. Object fusion in mediator sys-

tems. In VLDB, pages 413–424, 1996.

13. A. Saalfeld. Conﬂation-automated map compilation. IJGIS, 2(3):217–228, 1988.

14. A. Samal, S. Seth, and K. Cueto. A feature based approach to conﬂation of geospatial

sources. IJGIS, 18(00):1–31, 2004.

15. M. Sester, K. H. Anders, and V. Walter. Linking objects of different spatial data sets by

integration and aggregation. GeoInformatica, 2(4):335–358, 1998.

16. H. Uitermark, P. Van Oosterom, N. Mars, and M. Molenaar. Ontology-based geographic data

set integration. In Proceedings of Workshop on Spatio-Temporal Database Management,

pages 60–79, Edinburgh (Scotland), 1999.

17. V. Walter and D. Fritsch. Matching spatial data sets: a statistical approach. IJGIS, 13(5):445–

473, 1999.

18. J. M. Ware and C. B. Jones. Matching and aligning features in overlayed coverages. In

ACM-GIS, pages 28–33, 1998.

19. G. Wiederhold. Mediators in the architecture of future information systems. Computer,

25(3):38–49, 1992.

20. G. Wiederhold. Mediation to deal with heterogeneous data sources. In Introperating Geo-

graphic Information Systems, pages 1–16, 1999.

Integration of Heterogeneous Data from Multiple Location-Based Services Providers: a Use Case on Tourist Points of Interest

Thesis

Sep 2017

Bilal Berjawi

Location Based Services (LBS) had been involved to deliver relevant geospatial information based on a geographic position or address. The amount of geospatial data is constantly increasing, making it a valuable source of information for enriching LBS applications. However, these geospatial data are highly inconsistent and contradictory from one source to another. We assume that integrating geospatial data from several sources may improve the quality of information offered to users. In this thesis, we specifically focus on data representing Points of Interest (POIs) that tourists can get through LBS. Retrieving, matching and merging such geospatial entities lead to several challenges. We mainly focus on three main challenges including (i) detecting and merging corresponding entities across multiple sources and (ii) considering the uncertainty of integrated entities and their representation in LBS applications.

CADASTRAL AND URBAN MAPS ENRICHMENTS USING SMART SPATIAL DATA FUSION

Article

Full-text available

Jan 2023

Cadastral and urban map enrichment/upgrading is an essential requirement for smart urban management. The high pace of development and change in megacities can cause different challenges for urban organizations to reproduce their maps based on their need. New urban management aims and plans need new cadastral and urban maps with different standards and elements which may have existed in the other urban organization. Producing an original map or checking the maps of different organizations visually in a megacity is very costly and time-consuming. These challenges require an advanced integration approach to overcome them. Therefore, enriching maps with concerned organizations' maps and intelligent and automatically identifying as well as applying the changes in urban and cadastral maps will save time and cost for informed urban decision-making. This paper has employed the data of the third zone of the District six of the Municipality of Tehran, the capital of Iran, and identifies changes in the parcel’s geometry of the cadastre maps in comparison with the recently produced maps of the municipality of Tehran. After pre-processing the data, some spatial and attribute information are added to each feature, and the land parcels are enriched. By matching the algorithm and comparing the parcels geometry and attributes, suspicious parcels are identified by the logistic regression algorithm. The Accuracy and F1-Score of this model were 0.845 and 0.780, respectively. Finally, the suspicious parcels are checked and the parcels are located, deleted, merged, splitted and geometrically modified in the base map and the base map is enriched. This paper has successfully proposed a new framework for cadastral and urban map enrichment intelligently.

A System for Aligning Geographical Entities from Large Heterogeneous Sources

Article

Full-text available

Jan 2022
ISPRS

Aligning points of interest (POIs) from heterogeneous geographical data sources is an important task that helps extend map data with information from different datasets. This task poses several challenges, including differences in type hierarchies, labels (different formats, languages, and levels of detail), and deviations in the coordinates. Scalability is another major issue, as global-scale datasets may have tens or hundreds of millions of entities. In this paper, we propose the GeographicaL Entities AligNment (GLEAN) system for efficiently matching large geographical datasets based on spatial partitioning with an adaptable margin. In particular, we introduce a text similarity measure based on the local-context relevance of tokens used in combination with sentence embeddings. We then come up with a scalable type embedding model. Finally, we demonstrate that our proposed system can efficiently handle the alignment of large datasets while improving the quality of alignments using the proposed entity similarity measure.

Point of Interest Matching between Different Geospatial Datasets

Article

Full-text available

Oct 2019
ISPRS

Point of interest (POI) matching finds POI pairs that refer to the same real-world entity, which is the core issue in geospatial data integration. To address the low accuracy of geospatial entity matching using a single feature attribute, this study proposes a method that combines the D–S (Dempster–Shafer) evidence theory and a multiattribute matching strategy. During POI data preprocessing, this method calculates the spatial similarity, name similarity, address similarity, and category similarity between pairs from different geospatial datasets, using the multiattribute matching strategy. The similarity calculation results of these four types of feature attributes were used as independent evidence to construct the basic probability distribution. A multiattribute model was separately constructed using the improved combination rule of the D–S evidence theory, and a series of decision thresholds were set to give the final entity matching results. We tested our method with a dataset containing Baidu POIs and Gaode POIs from Beijing. The results showed the following—(1) the multiattribute matching model based on improved DS evidence theory had good performance in terms of precision, recall, and F1 for entity-matching from different datasets; (2) among all models, the model combining the spatial, name, and category (SNC) attributes obtained the best performance in the POI entity matching process; and (3) the method could effectively address the low precision of entity matching using a single feature attribute.

Entropy-Weighted Instance Matching Between Different Sourcing Points of Interest

Article

Full-text available

Jan 2016
Entropy

The crucial problem for integrating geospatial data is finding the corresponding objects (the counterpart) from different sources. Most current studies focus on object matching with individual attributes such as spatial, name, or other attributes, which avoids the difficulty of integrating those attributes, but at the cost of an ineffective matching. In this study, we propose an approach for matching instances by integrating heterogeneous attributes with the allocation of suitable attribute weights via information entropy. First, a normalized similarity formula is developed, which can simplify the calculation of spatial attribute similarity. Second, sound-based and word segmentation-based methods are adopted to eliminate the semantic ambiguity when there is a lack of a normative coding standard in geospatial data to express the name attribute. Third, category mapping is established to address the heterogeneity among different classifications. Finally, to address the non-linear characteristic of attribute similarity, the weights of the attributes are calculated by the entropy of the attributes. Experiments demonstrate that the Entropy-Weighted Approach (EWA) has good performance both in terms of precision and recall for instance matching from different data sets.

Smart Urban Cadastral Map Enrichment—A Machine Learning Method

Article

Full-text available

Mar 2024
ISPRS

Enriching and updating maps are among the most important tasks of any urban management organization for informed decision making. Urban cadastral map enrichment is a time-consuming and costly process, which needs an expert’s opinion for quality control. This research proposes a smart framework to enrich a cadastral base map using a more up-to-date map automatically by machine learning algorithms. The proposed framework has three main steps, including parcel matching, parcel change detection and base map enrichment. The matching step is performed by checking the center point of each parcel in the other map parcels. Support vector machine and random forest classification algorithms are used to detect the changed parcels in the base map. The proposed models employ the genetic algorithm for feature selection and grey wolf optimization and Harris hawks optimization for hyperparameter optimization to improve accuracy and performance. By assessing the accuracies of the models, the random forest model with feature selection and grey wolf optimization, with an F1-score of 0.9018, was selected for the parcel change detection method. Finally, the detected changed parcels in the base map are deleted and relocated automatically with corresponding parcels in the more up-to-date map by the affine transformation.

TOWARDS THE AUTOMATIC DETECTION OF GEOSPATIAL CHANGES BASED ON DIGITAL ELEVATION MODELS PRODUCED BY UAV IMAGERY

Article

Full-text available

Sep 2019

Reliable and accurate geospatial-databases (Digital Elevation Models, DEMs) are an essential component of Geographic Information Systems (GIS). One of their most important uses is change detection – an invaluable tool for environmental interpretation and evidence-based action. High-performance and inexpensive Unmanned Aerial Vehicles (UAVs) are increasingly used for the acquisition of timely geospatial information (imagery) for the production of DEMs for geospatial change detection. DEMs produced from UAV imagery have very high resolution and very good internal accuracy. However, their absolute location accuracy is inferior to other mapping technologies. Therefore, existing change detection methods, which are based on the point-by-point comparison, will perform poorly when processing DEMs created from UAV imagery since they are limited in reliably separating real physical changes from artifacts related to DEM inherent inaccuracy or errors. This paper presents a novel methodology that overcomes these deficiencies, by implementing a hierarchical analysis and modeling process, in which a sequence of methods is used to automatically identify and match unique homological features, such as building corners or topographic maxima, in the various height models. These provide geospatial anchors that bring out local geospatial discrepancies between the models. Those are then used to "repair" (align) the models to the same geospatial reference system, at which point change-detection is performed. Experimental results showed that when calculating point-by-point height differences, 98.99% of the area was falsely classified as changed, whereas implementing our method adequately detected all the actual changes in the area with no false positives, correctly classifying 0.16% of the area as changed.

Automated Areal Feature Matching in Different Spatial Data-sets

Article

Mar 2016

In this paper, we proposed an automated areal feature matching method based on geometric similarity without user intervention and is applied into areal features of many-to-many relation, for confusion of spatial data-sets of different scale and updating cycle. Firstly, areal feature(node) that a value of inclusion function is more than 0.4 was connected as an edge in adjacency matrix and candidate corresponding areal features included many-to-many relation was identified by multiplication of adjacency matrix. For geometrical matching, these multiple candidates corresponding areal features were transformed into an aggregated polygon as a convex hull generated by a curve-fitting algorithm. Secondly, we defined matching criteria to measure geometrical quality, and these criteria were changed into normalized values, similarity, by similarity function. Next, shape similarity is defined as a weighted linear combination of these similarities and weights which are calculated by Criteria Importance Through Intercriteria Correlation(CRITIC) method. Finally, in training data, we identified Equal Error Rate(EER) which is trade-off value in a plot of precision versus recall for all threshold values(PR curve) as a threshold and decided if these candidate pairs are corresponding pairs or not. To the result of applying the proposed method in a digital topographic map and a base map of address system(KAIS), we confirmed that some many-to-many areal features were mis-detected in visual evaluation and precision, recall and F-Measure was highly 0.951, 0.906, 0.928, respectively in statistical evaluation. These means that accuracy of the automated matching between different spatial data-sets by the proposed method is highly. However, we should do a research on an inclusion function and a detail matching criterion to exactly quantify many-to-many areal features in future.

A Visualization Method of Spatial Information based on Web Map Service

Article

Feb 2016

In these days, considering the trend to make various information blended based on spatial information like road, buildings and geography, it is to be very important to visualize maps for showing the information efficiently. However, geometry which is composed with line, polygon commonly used on web service has limitation to express information by limit of usage as well as spending certain time to show the information via map. That`s why this study develops the efficient way to visualize huge and complex spatial information. This way is to bring partial space with spatial query, and then query and expand information excluded the former area after detecting movement event based on client. When the way is implemented, it will be expected to make efficient visualization in entire system by not bringing unnecessary information but shortening spending time to show area because it just shows areas which clients want to see.

Uncertainty visualization of multi-providers cartographic integration

Article

Nov 2014
J VISUAL LANG COMPUT

Multiple cartographic providers propose services displaying points of interests (POI) on maps. However, the provided POIs are often incomplete and contradictory from one provider to another. Previous works proposed solutions for detecting correspondences between spatial entities that refer to the same geographic object. Although one can visualize the result of the integration of corresponding entities, users do not have any information about the quality of this integration. In this paper, we propose a solution to visualize the uncertainty inherent to a spatial integration algorithm. We present an integration process that identifies three levels of confidence for spatial and terminological integration results. Based on perceptual tests, we select visual variables to portray these three levels of confidence and we choose a visualization strategy. A prototype has been implemented to present the benefits of our proposal in a use-case scenario. This work has been realized within the framework of UNIMAP1 project.

Linking Objects of Different Spatial Data Sets by Integration and Aggregation

Article

Full-text available

Dec 1998

In order to solve spatial analysis problems, nowadays a huge amount of digital data sets can be accessed: cadastral, topographic, geologic, and environmental data, in addition to all kinds of other types of thematic information. In order to fully exploit and combine the advantages of each data set, they have to be integrated. This integration has to be established at an object level leading to a multiple representation scheme. Depending on the type of data sets involved, it can be achieved using different techniques. Such a linking has many benefits. First, it helps to limit redundancies and inconsistencies. Furthermore, it helps to take advantage of the characteristics of more than one data set and therefore greatly supports complex analysis processes. Also, it opens the way to integrated data and knowledge processing using whatever information and processes are available in a comprehensive manner. This is an issue currently addressed under the heading of ‘interoperability’. Linking has basically two aspects: on the one hand, the links characterize the correspondence between individual objects in two representations. On the other hand, the links also can carry information about the differences between the data sets and therefore have a procedural component, allowing the generation of a new data set based on given information (i.e., database generalization). In the paper three approaches for the linking of objects in different spatial data sets are described. The first defines the linking as a matching problem and aims at finding a correspondence between two data sets of similar scale. The two other approaches focus on the derivation of one representation from the other one, leading to an automatic generation of new digital data sets of lower resolution. All the approaches rely on methodologies and techniques from artificial intelligence, namely knowledge representation and processing, search procedures, and machine learning.

Ontology-Based Geographic Data Set Integration

Conference Paper

Full-text available

Jan 1999

In order to develop a system to propagate updates we investigate the semantic and spatial relationships between independently produced geographic data sets of the same region (data set integration). The goal of this system is to reduce operator intervention in update operations between corresponding (semantically similar) geographic object instances. Crucial for this reduction is certainty about the semantic similarity of different object representations. In this paper we explore a framework for ontology-based geographic data set integration, an ontology being a collection of shared concepts. Components of this formal approach are an ontology for topographic mapping (a domain ontology), an ontology for every geographic data set involved (the application ontologies), and abstraction rules (or capture criteria). Abstraction rules define at the class level the relationships between domain ontology and application ontology. Using these relationships, it is possible to locate semantic similarity at the object instance level with methods from computational geometry (like overlay operations). The components of the framework are formalized in the Prolog language, illustrated with a fictitious example, and tested on a practical example.

Matching and Aligning Features in Overlayed Coverages.

Conference Paper

Full-text available

Nov 1998

1. ABSTRACT The problems caused by locational error when overlaying spatial data from different sources have been recognised for some time, and much research has been directed towards finding solutions. In this paper we present a solution in the form of an algorithm that seeks to match and align semantically equivalent features prior to overlay. It is assumed that, because of locational error, semantically equivalent features will not always be geometrically equivalent. The technique has been developed to assist in the detection of change between multi-date vector-defined data sets. Initial results, obtained by applying our algorithm to land cover data, are presented. 1.1 Keywords Geometric overlay, locational error, equivalence testing, change detection, conflation

Finding corresponding objects when integrating several geo-spatial datasets

Conference Paper

Full-text available

Nov 2005

When integrating geo-spatial datasets, a join algorithm is used for finding sets of corresponding objects (i.e., objects that represent the same real-world entity). Algorithms for joining two datasets were studied in the past. This paper investigates integration of three datasets and proposes methods that can be easily generalized to any number of datasets. Two approaches that use only locations of objects are presented and compared. In one approach, a join algorithm for two datasets is applied sequentially. In the second approach, all the integrated datasets are processed simultaneously. For the two approaches, join algorithms are given and their performances, in terms of recall and precision, are compared. The algorithms are designed to perform well even when locations are imprecise and each dataset represents only some of the real-world entities. Results of extensive experiments show that one of the algorithms has the best (or close to the best) performances under all circumstances. This algorithm has a much better performance than applying sequentially the one-sided nearest-neighbor join.

A WFS-based mediation system for GIS interoperability

Conference Paper

Jan 2002

Ontology-driven geographic information systems

Article

Jan 1999

Geography markup language (gml) 2

Article

Jan 2007

Conflation: Automated Map Compilation

Article

Jan 1993
Int J Geogr Inform Syst

Alan Saalfeld

Map compilation is now being accomplished by computer. Interactive routines manipulate the graphic images of two different digital maps of the same region in order to permit map similarities and differences to be recognized more easily. Rubber-sheeting one or both of the maps permits an operator or the computer to align the maps in stages through methods of successive approximation and to review each new alignment. Techniques and methods developed have important applications in other areas of automated cartography. -from Author

Mediation to Deal with Heterogeneous Data Sources

Chapter

Oct 2006

Gio Wiederhold

The objective of interoperation is to increase the value of information when information from multiple sources is accessed, related, and combined. However, care is required to realize this benefit. One problem to be addressed in this context is that a simple integration over the ever-expanding number of resources available on-line leads to what customers perceive as information overload. In actuality, the customers experience data overload, making it nearly impossible for them to extract relevant points of information out of a huge haystack of data.

Ontology-Driven Geographic Information Systems.

Conference Paper

Nov 1999
COMPUT ENVIRON URBAN

This paper introduces a geographic information system architecture based on ontologies. Ontology plays a central role in the definition of all aspects and components of an information system in the so-called ontology-driven information systems. The system presented here uses a container of interoperable geographic objects. The objects are extracted from multiple independent data sources and are derived from a strongly typed mapping of classes from multiple ontologies. This approach provides a great level of interoperabil ity and allows partial integration of information when completeness is impossible.

Integrating Data from Maps on the World-Wide Web

Abstract and Figures

Recommended publications

A Directory for Web Service Integration Supporting Custom Query Pruning and Ranking

Complexity-Guided Case Discovery for Case Based Reasoning.

Interval-based attribute evaluation algorithm

Topic Maps as knowledge base to automatically generate medical recommendations