Conference PaperPDF Available

Integrating Data from Maps on the World-Wide Web

Authors:

Abstract and Figures

A substantial amount of data about geographical entities is available on the World-Wide Web, in the form of digital maps. This paper investigates the integration of such data. A three-step integration process is presented. F irst, geo- graphical objects are retrieved from Maps on the Web. Secondly, pairs of objects that represent the same real-world entity, in different maps, are disco vered and the information about them is combined. Finally, selected objects are presented to the user. The proposed process is efficient, accurate ( i.e., the discovery of cor- responding objects has high recall and precision) and it can be applied to any pair of digital maps, without requiring the existence of specific attributes. For th e step of discovering corresponding objects, three new algorithms are presented. These algorithms modify existing methods that use only the locations of geographical objects, so that information additional to locations will be utilized in the process . The three algorithms are compared using experiments on datasets with varying levels of completeness and accuracy. It is shown that when used correctly, ad- ditional information can improve the accuracy of location-based methods even when the data is not complete or not entirely accurate.
Content may be subject to copyright.
Integrating Data from Maps on the World-Wide Web
Eliyahu Safra1, Yaron Kanza2, Yehoshua Sagiv⋆⋆ 3, and Yerach Doytsher1
1Department of Transportation and Geo-Information, Technion, Haifa, Israel
{safra, doytsher}@technion.ac.il
2Department of Computer Science, University of Toronto, Toronto, Canada
yaron@cs.toronto.edu
3School of Engineering and Computer Science, The Hebrew University, Jerusalem, Israel
sagiv@cs.huji.ac.il
Abstract. A substantial amount of data about geographical entities is available
on the World-Wide Web, in the form of digital maps. This paper investigates the
integration of such data. A three-step integration process is presented. First, geo-
graphical objects are retrieved from Maps on the Web. Secondly, pairs of objects
that represent the same real-world entity, in different maps, are discovered and
the information about them is combined. Finally, selected objects are presented
to the user. The proposed process is efficient, accurate (i.e., the discovery of cor-
responding objects has high recall and precision) and it can be applied to any pair
of digital maps, without requiring the existence of specific attributes. For the step
of discovering corresponding objects, three new algorithms are presented. These
algorithms modify existing methods that use only the locations of geographical
objects, so that information additional to locations will be utilized in the process.
The three algorithms are compared using experiments on datasets with varying
levels of completeness and accuracy. It is shown that when used correctly, ad-
ditional information can improve the accuracy of location-based methods even
when the data is not complete or not entirely accurate.
1 Introduction
Many maps are available on the World-Wide Web, providing information on geograph-
ical entities. The information consists of both spatial and non-spatial properties of the
entities. Examples of spatial properties are location and shape of an entity. Examples
of non-spatial properties are name and address. The goal of integrating two maps is to
enable applications and users to easily access the properties that are available in either
one of those maps. Another reason for integration is that some geographical entities
may appear in only one of the maps. Integration increases the likelihood that for all the
relevant entities, in a specified geographical area, objects that represent these entities
are presented to the user.
An integration of two maps consists of the following three steps: extracting geo-
graphical objects from the maps, discovering pairs of objects that represent the same
real-world entity in different sources (such objects are called corresponding objects)
This author was supported by an NSERC grant.
⋆⋆ This author was supported by The Israel Science Foundation (Grant 893/05).
2
and presenting the result to the user. This paper deals mainly with the second step of dis-
covering corresponding objects. We use the term matching algorithm for an algorithm
that discovers corresponding objects in two given datasets of geographical objects.
Methods for integrating data from the Web, and especially matching algorithms,
should be able to cope with the following characteristics of the Web.
Data on the Web is heterogeneous. This means that the same piece of information
can have different forms in different sources. For example, in different sources, the
name of a geographical entity can have different spellings or can be written in dif-
ferent languages. This makes it difficult for integration methods to use properties,
such as names, for discovering corresponding objects. Another aspect of hetero-
geneity is incompleteness. Some attributes may not be available in some sources or
not specified for some objects.
Data may change frequently. For example, maps that contain hotels may also in-
clude reviews that are regularly added and updated by people who have stayed
in those hotels. In such cases, the integration should be performed in real time,
i.e., when the user sends her request for information. Otherwise, the integrated data
will not reflect the most recent changes in the sources. Consequently, an integration
method for data on the Web must be efficient, especially if the method is used in a
Web service that handles many requests concurrently.
Data on the Web can be incorrect or inaccurate. Hence, on one hand, integration
methods should rely mostly on object properties that are relatively accurate. On
the other hand, this justifies using, in Web applications, approximation matching
algorithms, i.e., highly (but not completely) accurate algorithms for discovering
corresponding objects.
Because of the above reasons, in this paper we consider techniques that start with
location-based matching algorithms and improve them. Relying primarily on locations
has the following three advantages. First, locations are always available for spatial ob-
jects and their degree of accuracy can be determined relatively easily. Hence, location-
based matching algorithms can be applied to objects from any pair of maps. Second,
location-based methods are suitable for integration of heterogeneous data, since it is
easy to compare a pair of locations even when they are stored or measured in different
ways. Third, there exist efficient location-based matching algorithms.
Location-based matching algorithms that are both efficient and effective were pre-
sented in the past [2,3]. These algorithms only use locations for finding corresponding
objects. Yet, in many cases, the accuracy of the integration can be improved significantly
by using attributes of the integrated objects in addition to locations. This is especially
important when dealing with data from the Web, where locations may be inaccurate.
In this paper, we explain how to use properties of integrated objects to increase the
effectiveness of location-based matching algorithms.
The main contributions of this paper are as follows. First, a complete process of
integrating data from maps on the Web is presented. This process is efficient and gen-
eral, in the sense that it can be applied to any pair of maps. Secondly, we show how,
in addition to locations, attributes of the objects can be used in the integration process.
Specifically, we present three new matching algorithms that use locations as well as
additional information. Thirdly, we describe the results of thorough experiments, on
3
datasets with different levels of accuracy and completeness, showing that additional
information can improve the results of location-based matching algorithms, when that
information is used appropriately.
The structure of the paper is as follows. In Section 2 we present our methods using a
real-world example of integrating maps showing hotels in the Soho area of Manhattan,
New-York. We present our three new methods in Section 3. In Section 4, we provide
the results of experiments we conducted on both real-world data and syntactically gen-
erated data. Also, we compare our methods based on the experimental results. Finally,
in Section 5, we discuses related work and conclude.
2 The Integration Process
We start by presenting our approach to integration of data from maps on the Web. We
do that using an example showing integration of information about hotels in the Soho
area of Manhattan, New-York. The data sources we used are Google Earth4and Yahoo
Maps5. Google Earth is a service that provides a raster image of almost any part of earth.
On top of the raster image it shows information such as roads, hotels, and restaurants.
In our example we are interested in information about hotels. For hotels, Google Earth
provides their names. The names are links that lead to additional information, e.g., by
following a link the address of the hotel is provided. A result of a search in Google
Earth for hotels in Soho is depicted in Fig. 1.
Yahoo Maps provides road maps for some major cities in the world. As in Google
Earth, maps include touristic information; however, in Yahoo, hotel names are not pre-
sented on the maps. Instead, a hotel is shown using an icon of a yellow square containing
a red circle, in the location of the hotel. The name of the hotel and additional informa-
tion such as the rank (i.e., number of stars) and price are available for one hotel at a
time. Two possible reasons for not writing hotel names on the map are (1) making the
presentation of the map simpler and easier to read (cartographic reasons), and (2) re-
stricting the information released per each user request, so that applications will not be
able to retrieve all the data from Yahoo to their local database (commercial reasons). A
result of a search in Yahoo Maps for hotels in Soho is depicted in Fig. 2.
It may seem a good solution to use, in the hotel scenario, a matching algorithm that
consider as corresponding objects, pairs of hotels that have the same name. However,
because names of hotels are not presented on maps from Yahoo, a matching based
on names is problematic. Two other difficulties in using hotel names in a matching
algorithm are the uncertainty in deciding whether two names refer to the same hotel
and the presence of errors in the data. In our case, uncertainty is due to the existence
of several hotels with similar names in the area we consider. For instance, consider the
following hotel names ”Grand Hotel”, ”Soho Grand Hotel” and ”Tribeca Grand Hotel”.
Are these the names of three different hotels or of only two different hotels? Another
case of uncertainty is when a hotel has more than one name. In the Soho area, the hotel
named ”Howard Johnson Express Inn” according to Google Earth, is named ”Metro
Three Hotel Llc” in Yahoo Maps, and indeed these are two names of the same hotel.
4http://earth.google.com
5http://maps.yahoo.com
4
Fig.1. A Map from Google Earth. Fig.2. A map from Yahoo Maps.
In this work we propose the following three-step integration process. (1) Retrieve
the maps, extract relevant objects from the maps and compute the location of the objects.
(2) Apply a matching algorithm for finding pairs of corresponding objects. (3) Display
objects to the user (or return them as a dataset) where each pair of corresponding objects
is represented by a single object. Objects that do not belong to any pair of corresponding
objects may also be presented.
We now illustrate these steps using the Soho-hotels scenario. Initially, a search for
hotels in Soho, New-York was made in both Google Earth and Yahoo Maps, and two
result images were retrieved (the images shown in Fig. 1 and Fig. 2). The two images
that were found in the search were oriented using geo-referencing. Then, geographical
objects were generated by digitizing the maps, that is, by identifying in the raster images
icons of hotels and calculating the locations of the hotels based on the geo-referencing.
In this example scenario, hotel names were inserted by a human user. In the future
we expect many maps on the Web to be in formats that computers can easily process
without the need of human intervention. GML (Geographic Markup Language) [1] is
an example of such a format.
The second step was to apply a matching algorithm to the two datasets that were
extracted from the maps. The result of this step consists of pairs of objects that represent
the same hotel, and of singletons representing hotels that appear in only one of the
sources. More details about the matching algorithm will be given in the next section.
The final step of the integration is displaying to the user the pairs and singletons
produced by the matching algorithm. Before providing the results, conditions can be
used for selecting which objects to display. Note that filtering the results at this step
makes it possible to apply conditions that use attributes from both sources.
3 Matching Algorithms
The most involved part of an integration process is the discovery of corresponding ob-
jects, i.e., the matching algorithm. Several matching algorithms that use only the loca-
tion of objects were proposed in the past [2,3]. We now present three new algorithms
5
that are built upon existing location-based algorithms and use attributes of objects for
improving the matching.
3.1 Framework
First, we present our framework. A dataset is a collection of geographical objects that
are extracted from a given map. Each object represents a single real-world geographical
entity and has a point location. (For an object that has a polygonal shape, we consider
the center of mass of the polygonal shape to be the point location of the object.) The
distance between two objects is the Euclidean distance between their point locations.
We denote by distance(a, b)the distance between two objects aand b.
An object may have, in addition to location, attributes that contain information about
the entity that the object represents. We distinguish between two types of attributes. An
attribute Iof objects in a dataset Ais unique if every two objects in Ahave different
values for I,i.e., Iis a candidate key. We consider Ias non-unique if there can be two
objects in Athat have the same value for I. For example, in a dataset of hotels, the name
of a hotel is a unique attribute, since it is unlikely that two hotels in the same vicinity
will have the same name. We consider rating (number of stars) as non-unique, because
two proximate hotels may have the same number of stars. When locations of objects are
not accurate, we can improve a basic matching algorithm by using additional attributes.
If the additional information is correct, a unique attribute can be used for discovering
pairs of corresponding objects that the basic algorithm fails to match. Both unique and
non-unique attributes can be used for detecting pairs of non-corresponding objects that
are, wrongly, deemed corresponding by a matching algorithm.
In integration of maps, locations of objects are not accurate, because the process of
extracting objects and computing their locations, by digitizing an image, introduces er-
rors. Furthermore, maps on the Web may not be accurate to begin with. Thus, given two
datasets Aand Bthat are extracted from two maps, two corresponding objects aA
and bBmay not have the same location. Yet, for each dataset, errors are normally
distributed with some standard deviation σ. So, for 98.8% of the objects, their distance
from the real-world entity that they represent is less than or equal to 2.5σ. Hence, for
98.8% of the pairs {a, b}of corresponding objects, it holds that distance(a, b)β,
where β=p(2.5σA)2+ (2.5σB)2is the distance bound of Aand B(σAand σBare
the standard deviations of the error distributions in Aand B, respectively). In our algo-
rithms, pairs {a, b}with distance(a, b)> β are never deemed corresponding objects.
A matching algorithm receives a pair of datasets Aand Band returns two sets P
and S. The set Pconsists of pairs {a, b}, such that aAand bBare likely to be
corresponding objects. The set Sconsists of singletons {s}(where sAB) such that,
with high likelihood, sdoes not have a corresponding object. Location-based matching
algorithms compute the sets Pand Saccording to the distance between objects.
3.2 The New Matching Algorithms
We now describe three new algorithms that receive an existing matching algorithm M
and improve it by using the information provided by some specified attributes. We di-
vide the input to these algorithm into two parts. One part consists of two datasets Aand
6
Pre-D
[M,X](A, B )
Parameters: A matching algorithm M, a set of unique attributes X
Input: Datasets Aand B
Output: A set Pof pairs and a set Sof singletons
1: P← ∅, S ← ∅, AA, BB
2: let βbe the distance bound of Aand B
3: for each aAand bBsuch that a.x =b.x for some attribute xXdo
4: if distance(a, b)βthen
5: PP∪ {a, b}
6: AA− {a}, BB− {b}
7: (P, S)← M(A, B )
8: PPP, S S
9: return (P, S)
Post-R
[M,X](A, B )
Parameters: A matching algorithm M, a set of attributes X
Input: Datasets Aand B
Output: A set Pof pairs and a set Sof singletons
1: (P, S)← M(A, B)
2: for each {a, b} ∈ Psuch that a.x 6=b.x for some attribute xXdo
3: PP− {a, b}
4: return (P, S)
Pre-F
[M,X,φ](A, B)
Parameters: A matching algorithm M, a set of non-unique attributes X, a factor φ
Input: Datasets Aand B
Output: A set Pof pairs and a set Sof singletons
1: P← ∅, S ← ∅
2: let distancen(x, y)be a new distance function that, initially, is equal to
distance(x, y)
3: for each aAand bBsuch that a.x 6=b.x for some attribute xXdo
4: distancen(a, b)φ·distance(a, b)
5: let Mnbe the matching algorithm Mwhen run using the distance function
distancen(x, y)instead of using the Euclidean distance function distance(x, y)
6: (P, S)← Mn(A, B)
7: return (P, S)
Fig.3. The algorithms Pre-process detection, Post-process removal and Pre-process factorizing
Bthat should be joined. The second part consists of M, a set Xof the given attributes
and, for the third algorithm, an additional factor φ. We denote by Pand Sthe set of
pairs and the set of singletons, respectively, that the algorithms return. The pseudocode
of all three algorithms is presented in Fig. 3.
7
Pre-process detection (
Pre-D
)
The
Pre-D
algorithm uses unique attributes for detecting corresponding objects, and
then it calls another matching algorithm on the remaining objects. The algorithm has
two steps.
1. For each pair of objects aAand bB, such that aand bhave the same value
for some unique attribute of Xand the distance between them does not exceed the
distance bound of Aand B, the pair {a, b}is added to P,ais removed from Aand
bis removed from B.
2. The matching algorithm Mis applied to the remaining objects of Aand B. Upon
termination, the pairs of the result are added to Pand the singletons—to S.
Post-process removal (
Post-R
)
The
Post-R
algorithm uses a set of attributes Xfor detecting pairs of objects that are
erroneously matched by another algorithm. The
Post-R
algorithm has two steps.
1. The matching algorithm Mis applied to Aand B. The result is a set Pof pairs and
a set Sof singletons.
2. For each pair of objects {a, b}in P, such that aand bhave different values for some
attribute of X, the pair {a, b}is removed from P.
Pre-process distance factorization (
Pre-F
)
The
Pre-F
algorithm uses a set Xof non-unique attributes as follows. For every pair of
objects aAand bBthat have different values for some attribute of X, the distance
between aand bis multiplied by the given factor φ > 1. Note that increasing the
distance between objects lowers the probability that they will be matched by a location-
based algorithm. The algorithm Muses the new distances to join Aand B.
In our experiments, we tested eight different combinations of the above algorithms.
Suppose that the set Ycontains the shared attributes of two datasets Aand B. Let
unique(Y)and non-unique(Y)be the sets of unique and non-unique attributes of Y,
respectively. Given a location-based matching algorithm M, the following are the eight
possible ways of computing the matching of Aand B.
1. Use only the location based algorithm M,i.e., return M(A, B).
2. Use
Post-R
with M. That is, return
Post-R
[M,Y ](A, B).
3. Use
Pre-D
with M. That is, return
Pre-D
[M,unique(Y)](A, B ).
4. Combine
Pre-D
and
Post-R
,i.e., return
Post-R
[
Pre-D
[M,unique(Y)],Y ](A, B ).
5. Use
Pre-F
with M. That is, return
Pre-F
[M,non-unique(Y)](A, B).
6. Combine
Post-R
with
Pre-F
,i.e., return
Post-R
[
Pre-F
[M,non-unique(Y)],Y ](A, B).
7. Combine
Pre-D
with
Pre-F
. That is, return the result of the following expression:
Pre-D
[
Pre-F
[M,non-unique(Y)],unique(Y)](A, B ).
8. Combine all the three methods by applying
Pre-F
,
Pre-D
,Mand, finally,
Post-R
,
i.e., return
Post-R
[
Pre-D
[
Pre-F
[M,non-unique(Y)]
,unique(Y)],Y ](A, B ).
8
3.3 Computing the Distance Bound
Applying a matching algorithm requires knowing the distance bound β(or an approxi-
mation of it). The approximation of βis computed based on approximations of σAand
σB—the standard deviations of the error distributions in the integrated datasets (see
Section 3.1). The values σAand σB(we also call them the errors of the datasets) are
sometimes provided with the maps, and in other cases we need to estimate them.
The error of a dataset is caused by errors in the procedure of collecting and process-
ing the geographical data. The procedure is different when generating raster (imagery)
maps and when vector (feature based) maps are produced. (See [11] for more detailed
descriptions of these procedures.)
Raster maps are typically generated from satellite or aerial photographs. There are
three main causes of error in the process of creating raster maps. First, errors are intro-
duced when the photos are orthorectified i.e., when correcting the photos to accurately
represent the surface of the earth. Second, the size of the pixels in the photo affects the
error. Currently, a resolution of 70cm per pixel at nadir is common in satellite photos
(e.g., in the two main high-resolution commercial earth-observation satellites IKonos
and QuickBird). The first two factors are relatively small and the main cause of error is
the third factor which is the accuracy of the geo-referencing process i.e., the accuracy of
matching earth coordinated to the photo. The accuracy of the geo-referencing depends
on the existence and accuracy of reference points. When no reference points exist, the
accuracy is about 10 meters, while when there are reference points, the accuracy is
about 1–10 meters, depends on the accuracy of the reference points. Extracting features
from the raster image (e.g., identifying the location of an hotel) also introduces an er-
ror which is approximately the number of pixels of the error in the extraction process
multiplied by the resolution.
Vector maps are usually created either by governmental mapping agencies, or by
commercial companies, according to an agreed mapping standard. The standard defines
accuracy requirements that depend on the map scale. Typically, for urban areas, map
scales are between 1/1000–1/10000. Normally, the required accuracy for such scales
is about 0.3–0.4mm. This means that at a scale of 1/1000, the error is about 0.3–0.4
meters. At a scale of 1/10000, the error is approximately 3–4 meters.
3.4 Measuring the Quality of the Result
We use recall and precision to measure the accuracy of a matching algorithm. Remem-
ber that the result of a matching algorithm consists of sets (singletons and pairs). A set
is correct if it is either a pair of corresponding objects or a single object that has no
corresponding object. Given the result of a matching algorithm, the recall is the ratio of
the number of correct sets in the result to the number of all correct sets. For example, a
recall of 0.8 means that 80% of the correct sets appear in the result. The precision is the
ratio of the number of correct sets in the result to the number of sets in the result. For
example, a precision of 0.9 means that 90% of the sets in the result are correct.
In our experiments, we knew exactly which sets were correct and, hence, were able
to determine the precision and recall. For synthetic data, all the information about the
data was available to us. For real-world data, we determined the correct sets manually,
using all the available information.
9
4 Experiments
In this section, we describe the results of extensive experiments on both real-world and
synthetically generated data. The goal of our experiments was to compare the eight
combinations, presented in Section 3.2, over data with varying levels of inaccuracy
and incompleteness. We also wanted to determine by how much our methods improve
existing location-based algorithms. For that, we tested the effect of our methods on
the following three location-based algorithms: nearest-neighbor (NN), mutually-nearest
(MUTU) and normalized-weights (NW); see [3] for a description of these algorithms.
4.1 Tests on Real-World Data
Fig.4. Tests on real-world data
We present the results of integrating the maps
of hotels in Soho as described in Section 2. The
Google-Earth map presents 28 hotels and the map
from Yahoo Maps presents 39 hotels and inns. A
total number of 44 hotels and inns appear in these
sources, where 21 hotels appear in both of the
sources while 23 appear in only one source. For
both sources, we used an error (σ) of 100 meter
because identifying the location of an hotel based
on an icon is highly inaccurate.
Figure 4 shows the harmonic mean of the re-
call and precision (HRP) for the three location-
based algorithms (NW, MUTU, NN). Each one of
the three algorithms was tested according to the first four combinations of Section 3.2.
(The other four combinations are not applicable, since the only attribute, hotel name,
is unique.) The third combination,
Pre-D
, is clearly the best for each of the three al-
gorithms. It is slightly better than the fourth combination, which includes both
Pre-D
and
Post-R
, since the attribute hotel name is not always accurate (e.g., one hotel has
different names in the two sources). For comparison, Figure 4 also shows the result of
matching just according to hotel names. Note that for combinations 2–4, the process
was semi-automatic, since hotel names do not appear in Yahoo Maps.
4.2 Tests on Synthetic Data
In order to test our methods on data with varying levels of accuracy and incompleteness,
we randomly generated synthetic datasets using a two-step process. First, the real-world
entities are generated. The locations of these entities are randomly chosen, according to
a uniform distribution, in a square area. Each entity has one unique attribute Uand one
non-unique attribute Nwith randomly-chosen values. The non-unique attribute has five
possible values (as for the number of stars of a hotel). In the second step, the objects
in each dataset are generated. Each object is associated with a distinct entity and its
location is chosen with an error that is normally distributed (relative to the location of
the entity). In each dataset, different objects correspond to distinct entities. For each
object, the attribute Uhas either the same value as in the corresponding entity, null (for
10
Fig.5. Results of Test I Fig.6. Results of Test II
incompleteness) or an arbitrary random value (for inaccuracy). We denote by c(U)the
percentage of objects that have a non-null value for Uand by a(U)the percentage of
objects that have either the correct value or null. Values are similarly assigned to N.
We present the results of two tests. In Test I, the values of the attributes are either
accurate or missing (i.e., null). In Test II, all the objects have values for Uand N, but
some of those values are inaccurate. In both tests, there are 1000 entities in a square
area of 1350 ×1350 meters with a minimal distance of 15 meters between entities.
Each dataset has 750 objects that are randomly chosen for 750 entities using a standard
deviation of σ= 12 meters for the error distribution. In Test I, the attributes in each
dataset have either the correct values or nulls as follows: a(U) = a(N) = 100%,
c(U) = 40% and c(N) = 60%. That is, only 40% of the objects have the correct value
for the unique attribute and only 60% of the objects have the correct value for the non-
unique attribute (if the value is not the correct one, then it is null). In Test II, attributes
always have non-null values but not necessarily the correct ones, i.e., c(U) = c(N) =
100% and a(U) = a(N) = 80%.
In Test I and Test II, we tried the eight combinations of Section 3.2 with each of the
three algorithms. The results, depicted in Fig. 5 and. 6, show the harmonic mean of the
recall and precision for the eight combinations involving each algorithm. Each bar is
for the combination identified by the number on that bar. For comparison, we also show
the result obtained by a matching algorithm that only uses the unique attribute (Name).
Test I shows that when information is partial but accurate, the eighth combination
that uses all of the three algorithms (
Pre-D
,
Post-R
and
Pre-F
) is the best. Test II shows
that when information is inaccurate,
Post-R
is not effective (as was also the case for the
real-world data) and it is better to use just
Pre-D
and
Pre-F
(the seventh combination).
Figures 7 and 8 show the performance of the NW method for varying levels of
completeness and accuracy. In Figure 7, the accuracy varies, i.e., a(U) = a(N) =
70% ...100%, and the completeness is fixed, i.e., c(U) = c(N) = 100%. In Figure 8,
the completeness varies, i.e., c(U) = c(N) = 40% ...100%, and the accuracy is fixed,
i.e., a(U) = a(N) = 100%. In each graph, the serial number refers to the combina-
tion that produced the graph. Note that the results of only 6 methods (1,2,3,5,7,8) are
presented, since the other two are inferior.
The followings are our conclusions from the tests.
1. When there is a unique attribute, it is always good to identify pairs and remove
them from the matching algorithm (Method 2).
11
Fig.7. Results of NW for varying accuracy Fig. 8. Results of NW for varying completeness
2. When there is a non-unique attribute, it is always good to use factorized distance
(Method 5).
3. Although additional information improves the quality of the results, the main factor
that determines the quality is still the location-based algorithm.
4. When the attributes are not accurate, using the additional information before the
matching improves the quality of the result. But using it after the location-based
matching has a negative effect, for the following reason. While there is only a low
probability that two proximate yet non-corresponding objects have the same value
for a unique attribute,there is a considerably higher probability that two correspond-
ing objects have different values for some unique attribute.
The tests show that in all cases using additional attribute before applying a location-
based matching algorithm improves the quality of the results. Applying additional in-
formation at the end yields an improvement only if that information is accurate.
5 Conclusion
Traditionally, integration of geo-spatial data is being done using map conflation [13,
6]. However, map conflation is not efficient since whole maps are integrated, not just
selected objects. Thus, conflation is not suitable for Web applications or in the context
of mediators [4,12,19,20] where users request answers to specific queries. Integrating
spatial datasets using only geometrical or topological properties [2, 3,14] or using only
alpha numeric attributes [9,10], both do not use all the available information but can be
combined using the approach we introduced in this paper.
Other approaches use both spatial and non-spatial attributes (e.g. [7,15, 17]). How-
ever, these approaches some time remain on the schema level, rather than actually
matching the objects, such as [7], or has large computation time as [15,17].
In this work we showed how data from maps on the Web can be integrated using
location-based algorithms, and how to utilize information additional to location when
such information exists. We presented three new matching algorithms and tested them
on data with varying levels of incompleteness and inaccuracy. Interestingly, our exper-
iments show that when the additional information is accurate it should be used both
before and after the location-based matching process. When the additional information
is not very accurate, the information should be used only prior to the location-based
12
matching process. Our experiments show that the new algorithms improve the existing
location-based matching algorithms.
References
1. Geographic Markup Language (GML). http://www.opengeospatial.org/standards/gml.
2. C. Beeri, Y. Doytsher, Y. Kanza, E. Safra, and Y. Sagiv. Finding corresponding objects when
integrating several geo-spatial datasets. In ACM-GIS, pages 87–96, 2005.
3. C. Beeri, Y. Kanza, E. Safra, and Y. Sagiv. Object fusion in geographic information systems.
In VLDB, pages 816–827, 2004.
4. O. Boucelma, M. Essid, and Z. Lacroix. A WFS-based mediation system for GIS interoper-
ability. In ACM-GIS, pages 23–28, 2002.
5. T.Bruns and M.Egenhofer. Similarity ofspatial scenes. In SDH, pages 31–42, Delft (Nether-
lands), 1996.
6. M. A. Cobb, M. J. Chung, H. Foley, F. E. Petry, and K. B. Show. A rule-based approach for
conflation of attribute vector data. GioInformatica, 2(1):7–33, 1998.
7. T. Devogele, C. Parent, and S. Spaccapietra. On spatial database integration. In IJGIS,
Special Issue on System Integration, 1998.
8. F. T. Fonseca and M. J. Egenhofer. Ontology-driven geographic information systems. In
ACM-GIS, pages 14–19, Kansas City (Missouri, US), 1999.
9. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava.
Approximate string joins in a database (almost) for free. In VLDB, pages 491–500, 2001.
10. L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an RDBMS for web
data integration. In Proceedings of the 12th international conference on World Wide Web,
pages 90–101, 2003.
11. J. C. McGlone. Manual of Photogrammetry, Fifth Edition. American Society of Photogram-
metry and Remote Sensing, 2004.
12. Y. Papakonstantinou, S. Abiteboul, and H. Garcia-Molina. Object fusion in mediator sys-
tems. In VLDB, pages 413–424, 1996.
13. A. Saalfeld. Conflation-automated map compilation. IJGIS, 2(3):217–228, 1988.
14. A. Samal, S. Seth, and K. Cueto. A feature based approach to conflation of geospatial
sources. IJGIS, 18(00):1–31, 2004.
15. M. Sester, K. H. Anders, and V. Walter. Linking objects of different spatial data sets by
integration and aggregation. GeoInformatica, 2(4):335–358, 1998.
16. H. Uitermark, P. Van Oosterom, N. Mars, and M. Molenaar. Ontology-based geographic data
set integration. In Proceedings of Workshop on Spatio-Temporal Database Management,
pages 60–79, Edinburgh (Scotland), 1999.
17. V. Walter and D. Fritsch. Matching spatial data sets: a statistical approach. IJGIS, 13(5):445–
473, 1999.
18. J. M. Ware and C. B. Jones. Matching and aligning features in overlayed coverages. In
ACM-GIS, pages 28–33, 1998.
19. G. Wiederhold. Mediators in the architecture of future information systems. Computer,
25(3):38–49, 1992.
20. G. Wiederhold. Mediation to deal with heterogeneous data sources. In Introperating Geo-
graphic Information Systems, pages 1–16, 1999.
... 3. Composite-based: combining the results of independently executed matchers [SKSD06]. ...
... This makes the detection of the corresponding entities a hard task. To ensure the effectiveness of matching, several approaches have been proposed [SGV06,OR07,SKSD06]. ...
... Two experiments were done by matching 50 POIs from Facebook places and 50 POIs from Qype with OpenStreetMap POIs separately, which produces 64% and 76% accuracy, respectively. Safra et al. propose a composite-based combination, such as the union or intersection of the results of similarity measures applied separately[SKSD06]. Experiments are done by matching 28 entities from Google Maps with 39 entities from Yahoo; these entities represent POIs of type hotels. ...
Thesis
Location Based Services (LBS) had been involved to deliver relevant geospatial information based on a geographic position or address. The amount of geospatial data is constantly increasing, making it a valuable source of information for enriching LBS applications. However, these geospatial data are highly inconsistent and contradictory from one source to another. We assume that integrating geospatial data from several sources may improve the quality of information offered to users. In this thesis, we specifically focus on data representing Points of Interest (POIs) that tourists can get through LBS. Retrieving, matching and merging such geospatial entities lead to several challenges. We mainly focus on three main challenges including (i) detecting and merging corresponding entities across multiple sources and (ii) considering the uncertainty of integrated entities and their representation in LBS applications.
... Hence, updating land use in cadastral maps is an essential step in sustainable development of countries (Cienciała et al., 2021). (Safra et al., 2006) have presented a complete process of integrating data from maps on the Web. They provide three algorithms for using spatial and additional information as features' attributes in the matching and integration processes. ...
... The map integration process has several steps. Three main steps include extracting geospatial objects from maps, matching similar objects among the maps, and representing the result to users and updating map features (Crommelinck et al., 2016;Heipke et al., 2008;Safra et al., 2006) In this research after the pre-processing step, area and the number of nodes have been calculated for each parcel in each dataset. Then the polygons have been converted to their centre points and the data are ready for matching by a buffer distance and finding suspicious polygons by the machine learning algorithm (logistic regression). ...
Article
Full-text available
Cadastral and urban map enrichment/upgrading is an essential requirement for smart urban management. The high pace of development and change in megacities can cause different challenges for urban organizations to reproduce their maps based on their need. New urban management aims and plans need new cadastral and urban maps with different standards and elements which may have existed in the other urban organization. Producing an original map or checking the maps of different organizations visually in a megacity is very costly and time-consuming. These challenges require an advanced integration approach to overcome them. Therefore, enriching maps with concerned organizations' maps and intelligent and automatically identifying as well as applying the changes in urban and cadastral maps will save time and cost for informed urban decision-making. This paper has employed the data of the third zone of the District six of the Municipality of Tehran, the capital of Iran, and identifies changes in the parcel’s geometry of the cadastre maps in comparison with the recently produced maps of the municipality of Tehran. After pre-processing the data, some spatial and attribute information are added to each feature, and the land parcels are enriched. By matching the algorithm and comparing the parcels geometry and attributes, suspicious parcels are identified by the logistic regression algorithm. The Accuracy and F1-Score of this model were 0.845 and 0.780, respectively. Finally, the suspicious parcels are checked and the parcels are located, deleted, merged, splitted and geometrically modified in the base map and the base map is enriched. This paper has successfully proposed a new framework for cadastral and urban map enrichment intelligently.
... Safra et al. [9] proposed one of the first approaches that extends existing methods for location-only matching to combine both spatial and nonspatial attributes. Scheffler et al. [3] used spatial attributes as a basic filter and subsequently combined entity labels to match POIs from social network datasets. ...
Article
Full-text available
Aligning points of interest (POIs) from heterogeneous geographical data sources is an important task that helps extend map data with information from different datasets. This task poses several challenges, including differences in type hierarchies, labels (different formats, languages, and levels of detail), and deviations in the coordinates. Scalability is another major issue, as global-scale datasets may have tens or hundreds of millions of entities. In this paper, we propose the GeographicaL Entities AligNment (GLEAN) system for efficiently matching large geographical datasets based on spatial partitioning with an adaptable margin. In particular, we introduce a text similarity measure based on the local-context relevance of tokens used in combination with sentence embeddings. We then come up with a scalable type embedding model. Finally, we demonstrate that our proposed system can efficiently handle the alignment of large datasets while improving the quality of alignments using the proposed entity similarity measure.
... Scheffler [9] uses spatial properties, such as a basic filter and then combines the name properties to match POIs from different social networking sites. Safra [27] extends the existing methods for location-based matching to algorithms that combine spatial and nonspatial attributes. McKenzie [28] applies the binomial logistic regression to assign weights and uses the weighted multiattribute models to find the corresponding objects. ...
Article
Full-text available
Point of interest (POI) matching finds POI pairs that refer to the same real-world entity, which is the core issue in geospatial data integration. To address the low accuracy of geospatial entity matching using a single feature attribute, this study proposes a method that combines the D–S (Dempster–Shafer) evidence theory and a multiattribute matching strategy. During POI data preprocessing, this method calculates the spatial similarity, name similarity, address similarity, and category similarity between pairs from different geospatial datasets, using the multiattribute matching strategy. The similarity calculation results of these four types of feature attributes were used as independent evidence to construct the basic probability distribution. A multiattribute model was separately constructed using the improved combination rule of the D–S evidence theory, and a series of decision thresholds were set to give the final entity matching results. We tested our method with a dataset containing Baidu POIs and Gaode POIs from Beijing. The results showed the following—(1) the multiattribute matching model based on improved DS evidence theory had good performance in terms of precision, recall, and F1 for entity-matching from different datasets; (2) among all models, the model combining the spatial, name, and category (SNC) attributes obtained the best performance in the POI entity matching process; and (3) the method could effectively address the low precision of entity matching using a single feature attribute.
... The method based on a single attribute (either a spatial or non-spatial attribute) is a relatively simple task; however, considering both the imprecision of the Volunteered Geographic Information (VGI) data attribute value [17,18] and the irregularities in the coding format resulting from linguistic ambiguity is more reasonable. Safra, et al. [19] combined the spatial and non-spatial attributes of geospatial data and improved the existing location-based matching algorithms by using Pre-D, Post-R and Pre-F technologies. Scheffler, et al. [20] used the spatial property as a fundamental filter and then combined the name metrics to match POIs from different social networking sites. ...
Article
Full-text available
The crucial problem for integrating geospatial data is finding the corresponding objects (the counterpart) from different sources. Most current studies focus on object matching with individual attributes such as spatial, name, or other attributes, which avoids the difficulty of integrating those attributes, but at the cost of an ineffective matching. In this study, we propose an approach for matching instances by integrating heterogeneous attributes with the allocation of suitable attribute weights via information entropy. First, a normalized similarity formula is developed, which can simplify the calculation of spatial attribute similarity. Second, sound-based and word segmentation-based methods are adopted to eliminate the semantic ambiguity when there is a lack of a normative coding standard in geospatial data to express the name attribute. Third, category mapping is established to address the heterogeneity among different classifications. Finally, to address the non-linear characteristic of attribute similarity, the weights of the attributes are calculated by the entropy of the attributes. Experiments demonstrate that the Entropy-Weighted Approach (EWA) has good performance both in terms of precision and recall for instance matching from different data sets.
Article
Full-text available
Enriching and updating maps are among the most important tasks of any urban management organization for informed decision making. Urban cadastral map enrichment is a time-consuming and costly process, which needs an expert’s opinion for quality control. This research proposes a smart framework to enrich a cadastral base map using a more up-to-date map automatically by machine learning algorithms. The proposed framework has three main steps, including parcel matching, parcel change detection and base map enrichment. The matching step is performed by checking the center point of each parcel in the other map parcels. Support vector machine and random forest classification algorithms are used to detect the changed parcels in the base map. The proposed models employ the genetic algorithm for feature selection and grey wolf optimization and Harris hawks optimization for hyperparameter optimization to improve accuracy and performance. By assessing the accuracies of the models, the random forest model with feature selection and grey wolf optimization, with an F1-score of 0.9018, was selected for the parcel change detection method. Finally, the detected changed parcels in the base map are deleted and relocated automatically with corresponding parcels in the more up-to-date map by the affine transformation.
Article
Full-text available
Reliable and accurate geospatial-databases (Digital Elevation Models, DEMs) are an essential component of Geographic Information Systems (GIS). One of their most important uses is change detection – an invaluable tool for environmental interpretation and evidence-based action. High-performance and inexpensive Unmanned Aerial Vehicles (UAVs) are increasingly used for the acquisition of timely geospatial information (imagery) for the production of DEMs for geospatial change detection. DEMs produced from UAV imagery have very high resolution and very good internal accuracy. However, their absolute location accuracy is inferior to other mapping technologies. Therefore, existing change detection methods, which are based on the point-by-point comparison, will perform poorly when processing DEMs created from UAV imagery since they are limited in reliably separating real physical changes from artifacts related to DEM inherent inaccuracy or errors. This paper presents a novel methodology that overcomes these deficiencies, by implementing a hierarchical analysis and modeling process, in which a sequence of methods is used to automatically identify and match unique homological features, such as building corners or topographic maxima, in the various height models. These provide geospatial anchors that bring out local geospatial discrepancies between the models. Those are then used to "repair" (align) the models to the same geospatial reference system, at which point change-detection is performed. Experimental results showed that when calculating point-by-point height differences, 98.99% of the area was falsely classified as changed, whereas implementing our method adequately detected all the actual changes in the area with no false positives, correctly classifying 0.16% of the area as changed.
Article
In this paper, we proposed an automated areal feature matching method based on geometric similarity without user intervention and is applied into areal features of many-to-many relation, for confusion of spatial data-sets of different scale and updating cycle. Firstly, areal feature(node) that a value of inclusion function is more than 0.4 was connected as an edge in adjacency matrix and candidate corresponding areal features included many-to-many relation was identified by multiplication of adjacency matrix. For geometrical matching, these multiple candidates corresponding areal features were transformed into an aggregated polygon as a convex hull generated by a curve-fitting algorithm. Secondly, we defined matching criteria to measure geometrical quality, and these criteria were changed into normalized values, similarity, by similarity function. Next, shape similarity is defined as a weighted linear combination of these similarities and weights which are calculated by Criteria Importance Through Intercriteria Correlation(CRITIC) method. Finally, in training data, we identified Equal Error Rate(EER) which is trade-off value in a plot of precision versus recall for all threshold values(PR curve) as a threshold and decided if these candidate pairs are corresponding pairs or not. To the result of applying the proposed method in a digital topographic map and a base map of address system(KAIS), we confirmed that some many-to-many areal features were mis-detected in visual evaluation and precision, recall and F-Measure was highly 0.951, 0.906, 0.928, respectively in statistical evaluation. These means that accuracy of the automated matching between different spatial data-sets by the proposed method is highly. However, we should do a research on an inclusion function and a detail matching criterion to exactly quantify many-to-many areal features in future.
Article
In these days, considering the trend to make various information blended based on spatial information like road, buildings and geography, it is to be very important to visualize maps for showing the information efficiently. However, geometry which is composed with line, polygon commonly used on web service has limitation to express information by limit of usage as well as spending certain time to show the information via map. That`s why this study develops the efficient way to visualize huge and complex spatial information. This way is to bring partial space with spatial query, and then query and expand information excluded the former area after detecting movement event based on client. When the way is implemented, it will be expected to make efficient visualization in entire system by not bringing unnecessary information but shortening spending time to show area because it just shows areas which clients want to see.
Article
Multiple cartographic providers propose services displaying points of interests (POI) on maps. However, the provided POIs are often incomplete and contradictory from one provider to another. Previous works proposed solutions for detecting correspondences between spatial entities that refer to the same geographic object. Although one can visualize the result of the integration of corresponding entities, users do not have any information about the quality of this integration. In this paper, we propose a solution to visualize the uncertainty inherent to a spatial integration algorithm. We present an integration process that identifies three levels of confidence for spatial and terminological integration results. Based on perceptual tests, we select visual variables to portray these three levels of confidence and we choose a visualization strategy. A prototype has been implemented to present the benefits of our proposal in a use-case scenario. This work has been realized within the framework of UNIMAP1 project.
Article
Full-text available
In order to solve spatial analysis problems, nowadays a huge amount of digital data sets can be accessed: cadastral, topographic, geologic, and environmental data, in addition to all kinds of other types of thematic information. In order to fully exploit and combine the advantages of each data set, they have to be integrated. This integration has to be established at an object level leading to a multiple representation scheme. Depending on the type of data sets involved, it can be achieved using different techniques. Such a linking has many benefits. First, it helps to limit redundancies and inconsistencies. Furthermore, it helps to take advantage of the characteristics of more than one data set and therefore greatly supports complex analysis processes. Also, it opens the way to integrated data and knowledge processing using whatever information and processes are available in a comprehensive manner. This is an issue currently addressed under the heading of ‘interoperability’. Linking has basically two aspects: on the one hand, the links characterize the correspondence between individual objects in two representations. On the other hand, the links also can carry information about the differences between the data sets and therefore have a procedural component, allowing the generation of a new data set based on given information (i.e., database generalization). In the paper three approaches for the linking of objects in different spatial data sets are described. The first defines the linking as a matching problem and aims at finding a correspondence between two data sets of similar scale. The two other approaches focus on the derivation of one representation from the other one, leading to an automatic generation of new digital data sets of lower resolution. All the approaches rely on methodologies and techniques from artificial intelligence, namely knowledge representation and processing, search procedures, and machine learning.
Conference Paper
Full-text available
In order to develop a system to propagate updates we investigate the semantic and spatial relationships between independently produced geographic data sets of the same region (data set integration). The goal of this system is to reduce operator intervention in update operations between corresponding (semantically similar) geographic object instances. Crucial for this reduction is certainty about the semantic similarity of different object representations. In this paper we explore a framework for ontology-based geographic data set integration, an ontology being a collection of shared concepts. Components of this formal approach are an ontology for topographic mapping (a domain ontology), an ontology for every geographic data set involved (the application ontologies), and abstraction rules (or capture criteria). Abstraction rules define at the class level the relationships between domain ontology and application ontology. Using these relationships, it is possible to locate semantic similarity at the object instance level with methods from computational geometry (like overlay operations). The components of the framework are formalized in the Prolog language, illustrated with a fictitious example, and tested on a practical example.
Conference Paper
Full-text available
1. ABSTRACT The problems caused by locational error when overlaying spatial data from different sources have been recognised for some time, and much research has been directed towards finding solutions. In this paper we present a solution in the form of an algorithm that seeks to match and align semantically equivalent features prior to overlay. It is assumed that, because of locational error, semantically equivalent features will not always be geometrically equivalent. The technique has been developed to assist in the detection of change between multi-date vector-defined data sets. Initial results, obtained by applying our algorithm to land cover data, are presented. 1.1 Keywords Geometric overlay, locational error, equivalence testing, change detection, conflation
Conference Paper
Full-text available
When integrating geo-spatial datasets, a join algorithm is used for finding sets of corresponding objects (i.e., objects that represent the same real-world entity). Algorithms for joining two datasets were studied in the past. This paper investigates integration of three datasets and proposes methods that can be easily generalized to any number of datasets. Two approaches that use only locations of objects are presented and compared. In one approach, a join algorithm for two datasets is applied sequentially. In the second approach, all the integrated datasets are processed simultaneously. For the two approaches, join algorithms are given and their performances, in terms of recall and precision, are compared. The algorithms are designed to perform well even when locations are imprecise and each dataset represents only some of the real-world entities. Results of extensive experiments show that one of the algorithms has the best (or close to the best) performances under all circumstances. This algorithm has a much better performance than applying sequentially the one-sided nearest-neighbor join.
Article
Map compilation is now being accomplished by computer. Interactive routines manipulate the graphic images of two different digital maps of the same region in order to permit map similarities and differences to be recognized more easily. Rubber-sheeting one or both of the maps permits an operator or the computer to align the maps in stages through methods of successive approximation and to review each new alignment. Techniques and methods developed have important applications in other areas of automated cartography. -from Author
Chapter
The objective of interoperation is to increase the value of information when information from multiple sources is accessed, related, and combined. However, care is required to realize this benefit. One problem to be addressed in this context is that a simple integration over the ever-expanding number of resources available on-line leads to what customers perceive as information overload. In actuality, the customers experience data overload, making it nearly impossible for them to extract relevant points of information out of a huge haystack of data.
Conference Paper
This paper introduces a geographic information system architecture based on ontologies. Ontology plays a central role in the definition of all aspects and components of an information system in the so-called ontology-driven information systems. The system presented here uses a container of interoperable geographic objects. The objects are extracted from multiple independent data sources and are derived from a strongly typed mapping of classes from multiple ontologies. This approach provides a great level of interoperabil ity and allows partial integration of information when completeness is impossible.