ArticlePDF Available

Spatial Cluster Detection in Spatial Flow Data

April 2016
Geographical Analysis 48(4)

April 2016
48(4)

DOI:10.1111/gean.12100

Authors:

Ran Tao

University of South Florida

Jean-Claude F. Thill

University of North Carolina at Charlotte

As a typical form of geographical phenomena, spatial flow events have been widely studied in contexts like migration, daily commuting, and information exchange through telecommunication. Studying the spatial pattern of flow data serves to reveal essential information about the underlying process generating the phenomena. Most methods of global clustering pattern detection and local clusters detection analysis are focused on single-location spatial events or fail to preserve the integrity of spatial flow events. In this research we introduce a new spatial statistical approach of detecting clustering (clusters) of flow data that extends the classical local K-function, while maintaining the integrity of flow data. Through the appropriate measurement of spatial proximity relationships between entire flows, the new method successfully upgrades the classical hot spot detection method to the stage of “hot flow” detection. Several specific aspects of the method are discussed to provide evidence of its robustness and expandability, such as the multiscale issue and relative importance control, using a real data set of vehicle theft and recovery location pairs in Charlotte, NC.

Basic flow model.

…

(a) Vehicle theft locations in Charlotte. (b) Vehicle recovery locations in Charlotte.

…

Detected flow clusters using different flow proximity measures. (a), (b), (c) use Flow Distance (equation [1]) with detection scale equal to 0.1 mile, 0.2 mile, and 0.3 mile, respectively. (d), (e), (f) use Flow Dissimilarity (equation [2]) with detection scale equal to 0.03, 0.04, and 0.05 respectively.

…

Flow clusters with different endpoint emphases. (a) Clusters more focused on theft locations (a51:5; b50:5). (b) Clusters more focused on recovery locations (a50:5; b51:5).

…

Figures - uploaded by Ran Tao

Content may be subject to copyright.

Content uploaded by Ran Tao

Content may be subject to copyright.

Spatial Cluster Detection in Spatial Flow Data

Ran Tao, Jean-Claude Thill

Department of Geography and Earth Sciences and Project Mosaic, University of North Carolina at

Charlotte, Charlotte, NC

As a typical form of geographical phenomena, spatial ﬂow events have been widely stud-

ied in contexts like migration, daily commuting, and information exchange through tele-

communication. Studying the spatial pattern of ﬂow data serves to reveal essential

information about the underlying process generating the phenomena. Most methods of

global clustering pattern detection and local clusters detection analysis are focused on

single-location spatial events or fail to preserve the integrity of spatial ﬂow events. In this

research we introduce a new spatial statistical approach of detecting clustering (clusters)

of ﬂow data that extends the classical local K-function, while maintaining the integrity of

ﬂow data. Through the appropriate measurement of spatial proximity relationships

between entire ﬂows, the new method successfully upgrades the classical hot spot detec-

tion method to the stage of “hot ﬂow” detection. Several speciﬁc aspects of the method

are discussed to provide evidence of its robustness and expandability, such as the multi-

scale issue and relative importance control, using a real data set of vehicle theft and

recovery location pairs in Charlotte, NC.

Introduction

Spatial ﬂows, also known as interactions between georeferenced places, constitute an enduring

object of research in spatial sciences. A ﬂow event in geography typically consists of two basic

components, namely the spatial one, represented as a vector, and the aspatial component, which

encapsulates the type or value it represents. Common examples include migration ﬂows, daily com-

muting ﬂows, international trade ﬂows, and ﬂows of information exchanged through telecommuni-

cation. In general, there are two types of ﬂow data, namely individual ﬂows and aggregated ﬂows

(Murray et al. 2011). The former pertain to individual activities, for example one person taking the

subway from home to work on a weekday morning. In contrast, the latter represent the movement

or interactions of a group of people or objects, for example a group of elks residing in the northern

section of Yellowstone National Park and migrating to lower altitudes before winter arrives.

Correspondence: Ran Tao, Department of Geography and Earth Sciences and Project Mosaic, Univer-

sity of North Carolina at Charlotte, Charlotte, NC

e-mail: rtao2@uncc.edu

[Correction added on 1 June 2016, after ﬁrst online publication: the publisher apologizes for the wrong

version of this article being inadvertently published due to a technical error. Corrections for clarity

have been made throughout the article in the text, equations and references, without impacting the

results or conclusions of the study].

Submitted: March 05, 2015. Revised version accepted: February 01, 2016.

doi: 10.1111/gean.12100 1

C2016 The Ohio State University

Geographical Analysis (2016) 00, 00–00

Understanding the pattern and dynamics of spatial ﬂows has been a long standing goal of

spatial scientists. With the fast development in sensor and GPS technologies in recent years, large

volumes of spatiotemporal data have become available with ﬁne granularity. In addition, emerg-

ing types of interactive activities, like information exchange on social media networking, enhance

the richness of ﬂow events. The increased availability of massive volumes of new forms of ﬂow

data inevitably brings unprecedented opportunities to enrich our understanding of patterns and

processes embedded in the geographic space, but this also presents new analytical challenges at

several levels. First, there is the challenge to develop advanced methods to generalize and extract

useful information from massive ﬂow data; next, the challenge to conceive new visualization

approaches to represent ﬂows more effectively; also, to design handy and highly interactive tools

to incorporate ﬂow data into geospatial information systems; and ﬁnally, to build spatial interac-

tion models to understand the nature behind locational choices and their relationships. Among

these endeavors, detecting spatial distribution patterns globally or locally, that is, clustered, scat-

tered, or random, across the spatial extent has garnered a lot of attention. While many contribu-

tions have used techniques such as Spatial Data Mining, Geovisualization, and Graph Theory

(Tobler 1987; Cui et al. 2008; Guo 2009; Zhu and Guo 2014) to better handle the large data vol-

ume, we contend that spatial statistics has not shown its full potentials for the detection of spatial

distribution patterns of ﬂow data, in spite of the abundance of effective spatial statistics techni-

ques that have been devised to deal with spatial point data, spatial line segment data, and spatial

polygon data (e.g., Moran’s I (Moran 1950), Geary’s C (Geary 1954), Getis and Ord’s G (Getis

and Ord 1992; Ord and Getis 1995), Ripley’s K-function [Ripley 1976]). Thus, it is the purpose

of this study to develop novel spatial statistical approaches to detect spatial clustering patterns in

ﬂow data with the aim of understanding their spatial relationships, while preserving the integrity

of the ﬂow data. To this end, we introduce new spatial proximity measures tailored for ﬂow data,

on the basis of which we extend the well-known point data analysis method, namely the local

Ripley’s K-function, to the spatial ﬂow context. The new approach is presented and the evidence

of its robustness and efﬁciency is provided via experiments on a real data set.

The rest of this article is organized as follows. In the second section a brief literature

review is provided, which covers previous studies on spatial clustering detection especially

those pertaining to ﬂow data. Then a thorough explanation of our new approach is presented,

including both the theoretical foundations and the technical details. The fourth section consists

of experiments with real data, along with evaluations of the performance of the proposed ana-

lytical method. We conclude with a discussion of the main characteristics and contribution of

our method, as well as proposed future extensions.

Literature review

Given the general tendency of spatial phenomena to co-occur spatially as encapsulated by

Tobler’s First Law of Geography (1970), spatial clustering is one of the most common spatial

patterns of point events. It represents a general tendency of events occurring closer to each

other than one might expect by chance (Waller 2009). An extensive body of literature on clus-

ter detection and monitoring exists that has advanced various methods to identify such pattern.

Several excellent references provide overviews of the concepts and methods involved (e.g.,

Diggle 1983; Cressie 1993; Fortin and Dale 2009; Symanzik 2014).

Early studies were mostly concerned with the overall spatial pattern exhibited by the

events and devised spatial statistics as a single index, sometimes labeled as “global” statistics,

Geographical Analysis

to depict the nature of events and of the spatial process producing a certain spatial distribution

within the entire study area. Well-known examples include Moran’s I, Geary’s C, Quadrat

Analysis, Nearest Neighbor Index, and Ripley’s K-function. However, one of the fundamental

assumptions of these methods, namely the spatial stationarity, is difﬁcult to comply with in

many real situations. Furthermore, a single statistic does not allow to further investigate more

detailed patterns and relationships such as how the spatial process associated with one variable

would be dependent on others (Fotheringham 1997). To cope with such issues, spatial pattern

analysis has shifted toward the development of local statistics for detecting spatial clusters. In

contrast with global spatial clustering methods that are designed to identify whether there exists

a general tendency for events to occur nearer other events than expected by chance, techniques

for localized cluster detection are aimed at ﬁnding anomalies and interesting collections of spa-

tial events within the study area that appear to be inconsistent with the background conceptual

model of how events arise (Besag and Newell 1991; Waller 2009). Notable approaches include

the geographical analysis machine (GAM) (Openshaw et al. 1987) and its derivative methods

(Besag and Newell 1991; Fotheringham and Zhan 1996), the local version of Ripley’s K-

function (Getis and Franklin 1987), local indicators of spatial association (LISA) especially the

local Moran’s I statistic, local Geary’s C (Anselin 1995), and local G statistic (Getis and Ord

1992; Ord and Getis 1995). Some local detection methods around predetermined locations are

called “focused tests” to differentiate them from those based on randomly chosen event loca-

tions (Besag and Newell 1991). The local Cross K-function is such a focused test to identify

clusters of events around speciﬁc locations, such as crime instances around railway stations or

shopping malls (Boots and Okabe 2007). Regardless of the technical details, local cluster

detection methods all hold the advantages that they can better integrate with the fast-

developing GeoComputation technology to handle large data sets and their results can be well

illustrated with the visualization and mapping capabilities of Geographic Information Systems

(GIS) (Fotheringham 1997). Recent contributions come from both the methodological develop-

ment perspective, such as the network-constrained local K-function and local Moran’s I

(Yamada and Thill 2007, 2010), the Multidirectional Optimum Ecotope-Based Algorithm

(AMOEBA) (Aldstadt and Getis 2006), and from the toolset designing perspective, for exam-

ple R, ArcGIS, GeoDa (Anselin, Syabri, and Kho 2006), SaTScan (Kulldorff et al. 1997).

The preponderance of the literature on spatial point pattern analysis treats each point as an

event independent of all the others. Spatial ﬂow data, however, encompass at least two points,

one corresponding to the origin or start of the ﬂow and one for the destination or end of the ﬂow.

Flow data, therefore, differ fundamentally from single point data and methods designed to handle

the latter cannot be directly applied to ﬂow data. Several endeavors have been undertaken in pre-

vious research to ﬁll this gap. Berglund and Karlstr€

om (1999) applied the Gistatistics introduced

by Getis and Ord (1992) and Ord and Getis (1995) to identify local spatial association in ﬂow

data. Although several different spatial weight matrices were proposed in this article to address

spatial non-stationarity, only the simplest binary spatial weight matrix based on identical origins

or destinations was implemented, which certainly limits its usage. Lu and Thill (2003) proposed

an ad hoc and partially qualitative approach in which they apply point cluster detection methods

to analyze origin and destination points respectively, and combine the two sets of results via a

relationship table to conclude on the patterns exhibited by the ﬂows. Related issues such as sensi-

tivity to scale and neighborhood deﬁnition were discussed in their later work (Lu and Thill 2008).

While decomposing one-dimensional ﬂows into zero-dimensional points can considerably sim-

plify the problem, this approach would inevitably overlook the simultaneity of some critical

Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection

information, such as ﬂow direction and ﬂow length. Murray et al. (2011) departed from this

approach by combining exploratory spatial data analysis and conﬁrmatory circular statistics to

analyze the similarities of ﬂow direction and length. However, they sacriﬁce the actual locational

information in the process so that little knowledge on spatial relationships between movements

can be extracted. More recently, Liu, Tong, and Liu (2015) extended both global and local Mor-

an’s I statistics to a ﬂow context, considering movement distances and directions at once. None-

theless, their approach is still based on the spatial proximity relationship of either set of end

points rather than entire vectors. Therefore, we contend that it remains within the scope of meas-

uring spatial autocorrelation of vectors/ﬂows in parts rather than as a whole. The method pro-

posed in this article departs radically from the existing literature by maintaining the integrity of

ﬂow data. It not only fully considers ﬂow characteristics, that is, end points, length, and direction,

but also builds on proper measurement of spatial proximity relationship between entire ﬂows.

While this article mainly focuses on spatial statistical methods, contributions from other

perspectives are also worth considering. Various research contributions apply techniques of

data mining and geovisualization to investigate the properties of spatial ﬂows. Tobler (1987)

suggested that selective information aggregation and removal is an effective strategy for identi-

fying patterns through visualization and he pioneered this idea to analyze migration ﬂows with

computer-drawn maps. Beneﬁting from burgeoning computing capability and visualization per-

formance, many contributions have emerged to be both effective and efﬁcient, especially for

large data sets. K-means algorithms have proved very effective with respect to multilocation

spatial data (Genolini and Falissard 2010; Ossama, Mokhtar, and El-Sharkawi 2011). Density-

based clustering methods have also been adjusted to the nature of ﬂow data by summarizing

the distributions of origins and destinations (Nanni and Pedreschi 2006; Zhu and Guo 2014).

Geometry-based edge-bounding is another type of approaches to reduce the visual clutter

caused by extensive edge crossing in ﬂow maps (Cui et al. 2008). To serve the same purpose,

Guo (2009) proposed a visualization framework to partition spatial interactions into their

“nature” regions and discover mixing patterns of ﬂow networks. In general such visual analyti-

cal methods embrace the principle of data mining and analytical classiﬁcation methods

designed to group observations into “clusters” based on similarity (Waller 2009); therefore,

they are also named “cluster analysis.” Given the overlap in terminology is really confusing, it

is necessary to differentiate these “cluster analysis” methods from the spatial statistical

approaches of cluster detection presented in this article. While we mainly focus on building

innovative spatial statistics here, it is potentially very meaningful to incorporate these methods

of exploratory analysis as a prior step to help propose hypotheses.

Methodology

The principle

In spatial analysis, cluster detection is an approach to second-order analysis that is designed to

examine spatial dependence, or spatial relationships between events (Getis and Franklin 1987).

The ﬁrst step is to choose an appropriate measure of spatial proximity between events, for

which distance is a common choice. Ripley’s K-function, Geographic Analysis Machine, Near-

est Neighbor Index and many other statistical approaches are all distance-based methods. Aside

from the default Euclidean distance, other kinds of distance are also applied in some cases, for

instance the network distance (Yamada and Thill 2007). With spatial ﬂow data, there is no nat-

ural mean to measure spatial proximity due to the multilocation nature of ﬂow records and this

Geographical Analysis

is arguably the biggest difﬁculty in analyzing spatial patterns of ﬂow data. In other words, with

appropriately measured spatial proximity, cluster detection on ﬂows boils down to the same

algorithmic processes as for points or polygons. Although various distance measures have been

proposed in data mining studies of trajectory, for example using the Hausdorff distance to

extract clustered line segments of trajectories (Lee, Han, and Whang 2007; Chen et al. 2011),

we argue that these distances are not suitable to measure proximity between ﬂows which have

explicit and meaningful location correspondence. Accordingly, we devise a new proximity

measure called the “Flow Distance” and a variant called the “Flow Dissimilarity.” Then we

extend a well-developed spatial point statistic, namely Ripley’s K-function, to the spatial ﬂow

context based on the newly deﬁned proximity measures. Statistical signiﬁcance is tested by

Monte Carlo simulation against the null hypothesis of spatial randomness. Several aspects such

as the multiscalar relevance, relative importance control, and ﬂow value, are discussed in detail

here to demonstrate that this method is versatile and practical.

Flow model

The ﬁrst step is to deﬁne the study object, namely the spatial ﬂow process. Fig. 1 shows two

instances of a spatial process Fthat starts at location Oand ends at location D. Basic character-

istics of Finclude length: l5j

ODj; direction: same as the direction of vector

OD; type: T(e.g.,

commuting ﬂow); and value W(e.g., the number of commuters). This basic model is used to

represent spatial ﬂow processes in the rest of the article.

Flow proximity

As mentioned earlier, deﬁning an appropriate proximity measure is the key to decode spatial

ﬂow patterns. Here we introduce such measures based on which both intrarelationships and

interrelationships of ﬂows can be extracted.

Let us take the simple case of measuring the spatial proximity between ﬂow F

(with origin

point O

) and destination point D

)) and ﬂow F

(from point O

) to point D

)) in a two-dimensional space (Fig. 1). Measuring distance between these two spatial ﬂows

following the approaches advocated so far in the literature would generally be inadequate

because distance between either origin points or destination points cannot fully represent the

closeness between ﬂows in their entirety. For instance, when both origins are a short (or long)

distance to each other and the same can be said of destinations, we can expect that F

and F

are also close (or distant, respectively). However, things become less trivial when the two end-

point pairs show dissimilar spatial closeness, that is, origins are close while destinations are

distant, or vice versa. Using categorical descriptions is certainly one way to associate distances

among origins and destinations. For instance, both distances being short (or both endpoint pairs

belong to the same region) would correspond to “high” spatial association between ﬂows while

only one pair of end points being close (or belonging to the same region) would correspond to

Figure 1. Basic ﬂow model.

Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection

a “medium” degree of association (Berglund and Karlstr€

om 1999; Lu and Thill 2003; Zhu and

Guo 2014). While such approaches make sense to some extent, they are very sensitive to the ad

hoc description standards and exhibit limited external validity.

Unlike approaches treating spatial ﬂows as two separate sets of endpoints, we propose to

calculate a ﬂow distance that regards ﬂows as inseparable objects. A ﬂow process F

with origin

point O

) and destination point D

) can be seen as a vector point with four coordi-

nates F

) in a four-dimensional space. Derived from the general function of Euclid-

ean distance, we deﬁne the Flow Distance between ﬂows F

) and F

) as:

FDij5ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

axi2xjÞ21yi2yjÞ2

i

1b ui2ujÞ21vi2vjÞ2

i

hh

or simplify as :FDij 5ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

adO21bdD2

q:(1)

where FDij denotes the distance between these two ﬂows; dOand dDare the Euclidean distan-

ces between the two origins and two destinations, respectively; the coefﬁcients aand bserve

to control the relative importance of either sets of endpoints (a>0; b>0;a1b52; by

default a5b51). Through this deﬁnition, both the closeness of origins and of destinations

make a contribution to the calculation of the Flow Distance. For example in Fig. 2a,

FD125ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

22122

p5ﬃﬃﬃ

p. The value of Flow Distance becomes larger (or smaller) if both end-

points are moved further (or closer) to their counterpart at the same time, for example, FD12

increases to ﬃﬃﬃﬃﬃ

pin Fig. 2b while it decreases to ﬃﬃﬃ

pin Fig. 2c. This corresponds to the general

sense that proximities of endpoints are positively correlated to the ﬂow closeness.

More importantly, the distance between origins and the distance between destinations are

integrated by the same square root transformation so their variations are captured continuously

and consistently, which leads to greater accuracy than qualitative descriptors. For instance,

compared with Fig. 2a, Flow F2in Fig. 2d has its origin moved toward F1’s and has its destina-

tion moved away from F1’s. According to previous methods, whether these two ﬂows in

Figure 2. Flow Distance Examples.

Geographical Analysis

Fig. 2d are as close as they are in Fig. 2a completely depend on the deﬁnition of endpoint’s

contiguity relationship. In other words, if two points are deﬁned as contiguous when their dis-

tance is less than or equal to 2, F1and F2would have two contiguous endpoint pairs in Fig. 2a

but only one in Fig. 2d. As a result, the proximities between F1and F2are radically different.

In contrast, by our deﬁnition of Flow Distance, measuring proximity between two ﬂows is

not subject to the deﬁnition of endpoint’s own region or the description of the combined end-

point’s closeness. Instead, we capture the variation of all locations seamlessly and let the

ﬂow data decide its own spatial neighbors for itself. Accordingly, the distance between F1

and F2can be calculated and compared directly as FD12 equals ﬃﬃﬃ

pin both Fig. 2a and d

scenarios.

Nevertheless, only using the location information of endpoints may be inadequate some-

times because a ﬂow does not only represent the interaction or movement between two loca-

tions, but also indicates how far and in what direction the interaction or movement happens. As

shown in Fig. 2e, two ﬂows have exactly the same endpoint distances as Fig. 2a, therefore the

Flow Distances are the same according to equation (1). Regardless of the real data type they

represent, it would be controversial to say that the two ﬂows in Fig. 2e are as close as the ones

in Fig. 2a given that they are separated much more, relative to their lengths. Controlling for the

impact of ﬂow length may be necessary to avoid false positive detection of ﬂow clusters. To

this end, we propose an extended version of Flow Distance that involves a rescaling, as pro-

vided by equation (2). By dividing by the geometric mean of two ﬂow lengths, a ﬂow pair with

longer average length would be measured closer, ceteris paribus. Therefore, the distance

between the short ﬂows F1and F2in Fig. 2e becomes four times longer as the one in Fig. 2a.

The rationality behind this adjustment is that under many circumstances it is more difﬁcult or

rarer to witness spatial interaction or movement happen between two distant locations than

close locations. For example wild animals are more likely to travel to a nearby river than a dis-

tant one to seek water. Incorporating ﬂow length into the measure is one way to adjust the crite-

rion of clustering detection for ﬂows with unequal lengths. Given the adjustment would impair

some of the metric properties of distance, we name the adjusted Flow Distance as Flow Dissim-

ilarity, short for FDS in the rest of this article. Also we choose to use the geometric mean over

the arithmetic mean of ﬂow lengths because the former is more capable to attenuate the impact

of extremely unequal length values. In addition, it avoids the limit case of zero-length ﬂows.

FDSij 5ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

a½ðxi2xjÞ21ðyi2yjÞ21b½ðui2ujÞ21ðvi2vjÞ2

LiLj

or :FDSij5ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

adO21bdD2

LiLj

s:(2)

where FDSij denotes the Flow Dissimilarity between these two ﬂows; Liand Ljare the ﬂow

lengths; the rest are the same as equation (1).

Although considering ﬂow length in spatial pattern detection can be very useful and some-

times necessary, we are not arguing that this is a better approach in all situations. Instead, we

believe that they both make sense under certain circumstances. Evidences can be found in liter-

ature that ﬂow length was not discussed in some research (Berglund and Karlstr€

om 1999; Lu

and Thill 2003, 2008; Zhu and Guo 2014), while it was taken into consideration in some others

(Murray et al. 2011; Liu, Tong, and Liu 2015). In this research experiments have been

Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection

conducted with both Flow Distance (equation [1]) and Flow Dissimilarity (equation [2]) for

comparison, and details are provided in the case study section below.

Besides endpoint locations and ﬂow length, the only remaining spatial element of a ﬂow is

its directionality. Although we do not directly measure directionality in equations (1) and (2),

its impact is implicitly accounted for. As illustrated in Fig. 2f, to maintain F2at the same dis-

tance from F1, according to our Flow Dissimilarity equation it is sufﬁcient to keep its origin

and destination at a constant distance from F1’s two endpoints, that is, to keep its endpoints sit-

uated on circles centered on F1’s two endpoints (the dashed rings), for example, F’

2. Given this

geometric constraint, there are in fact few degrees of freedom in directionality for ﬂows that

exhibit a tendency toward clustering. Therefore we argue that it is not necessary to discuss ﬂow

direction alone since it is heavily dependent on the endpoint locations and ﬂow length. Our test

results have also demonstrated this argument by identifying clusters of similar-direction ﬂows.

Last but not least, the coefﬁcients (a;b) in the distance and dissimilarity functions are

designed to offer some ﬂexibilities in measuring real ﬂow data. The basic functions by default

(a5b51) assign equal importance to the origin location and destination location of each

ﬂow. However, the research objectives may lead us to pay closer attention to one set of end-

points over the other. For instance, in a study of settlement of foreign immigrants in New York

City in relation to national origin, socio-spatial patterns and processes would be better

informed if more weight is put on where immigrants choose to reside rather than where they

come from. As another example, the manager of a shopping center would be more interested in

where customers come from so that more targeted and effective advertising strategies can be

designed. The inconsistent spatial scale of ﬂow origins and destinations may be another justiﬁ-

cation to rebalance the relative importance of origins and destinations in the Flow Distance and

Dissimilarity measures. For example, different land uses are known to be spatially distributed

differently across cities; in particular employment sites tend to be more clustered geographi-

cally than residential land uses. Therefore, to avoid a statistical bias, a spatial analysis of com-

muting ﬂows should control for the spatial distribution of potential ﬂow origins and

destinations. With appropriate calibration, the same distance (e.g., 500 meters) would have the

same impact on describing the proximity between two origin locations or between two destina-

tion locations.

By adjusting the values of aand b, the Flow Distance or Dissimilarity can receive differ-

ent contributions from origins and destinations. For example, if we assign a51.5 and b50.5,

the Flow Distance or Dissimilarity would be more sensitive to the change of origin locations

and the corresponding spatial pattern would put more weight on where ﬂows start. In addition,

we restrict that a1b52 to ensure the results with different coefﬁcients are comparable. They

both must also have positive value to match the reality of ﬂow data sets rather than points.

Hot spot detection method

Using our Flow Distance (or Flow Dissimilarity) as the spatial proximity measure, it becomes

possible to apply well-developed distance-based methods to detect spatial clusters of ﬂow data.

In this study we choose to adjust the local version of Ripley’s K-function. As a classical clus-

tering detection method, the K-function has been continuously implemented and enhanced

since it was redeﬁned by Ripley in 1976 (Ripley 1976; Okabe, Boots, and Satoh 2007). The

fundamental idea of the K-function is to count the number of events within a certain distance

threshold of randomly selected event locations. This number is then used to calculate K-

Geographical Analysis

function value after dividing by the event density and the analysis is repeated for other distan-

ces within a set interval. To obtain statistical conclusions, the K-function value needs to be

compared with the expected value given by the null hypothesis, for example Complete Spatial

Randomness (CSR). If the observed value is higher than expected, the study events exhibit a

tendency toward clustering; or dispersed, if it is lower. Monte Carlo simulation is a frequently

applied technique to assess statistical signiﬁcance (Openshaw et al. 1987). One of the meaning-

ful extensions of K-functions was introduced by Getis and Franklin (1987), based on second-

order neighborhood analysis of mapped point patterns, which has been known as local K-

function analysis. An extension of the local K-function (equation [4]) is applied in this research

to ﬂow data using the four-dimensional approach introduced above. Instead of counting point

events, ﬂow events are counted within a certain Flow Distance (or Flow Dissimilarity) rof

ﬂow F

to represent the function value:

LocKirðÞ5E number of other flow events within r of flow iðÞ:(3)

where LocKirðÞis the local K-function value of ﬂow F

at scale r. The scale r, also known as

the detection window radius or threshold distance, has always been a crucial factor in spatial

statistics, especially the K-function, which is even known as “multi-distance cluster analysis”.

In our approach we implement the local K-function at multiple scales as well. By increasing

the magnitude of scale rwithin a certain range deemed suitable to the process under study, for

example, from 0.1 mile to 1 mile when using Flow Distance or from 0.1 to 1.0 when using

Flow Dissimilarity, it is convenient to detect multiscale clustering patterns at once.

As with other spatial statistical methods, statistical inference is an important part of reach-

ing any conclusion. Given the nature of ﬂow data, normal approximation is not an appropriate

null hypothesis (Lu and Thill 2003, 2008; Liu, Tong, and Liu 2015). Random permutations

with Monte Carlo simulation can better serve this purpose. In a two-dimensional space, there

are normally more than one way to simulate a set of ﬂows. On the one hand, we can proceed

by setting the location of two endpoints for each simulated ﬂow. Alternatively, we could use

observed ﬂows as objects and move or rotate them in the study area according to some random-

ization procedure. Whatever the technique used, the theory or basic assumptions behind the

simulation must be fully spelled out.

The simplest way is to simulate two sets of points randomly and independently based on

Poisson distribution, and then pair and connect them as ﬂows. However, the customary null

hypothesis for point data, that is, CSR, may not be the best option for ﬂows. A more sensible

way is conditional spatial randomness, which has been used widely for computing the pseudo

P-value in spatial statistics (Anselin 1995). In terms of ﬂow data, the “condition” should be

considered when the endpoints are restricted to the distribution of an at-risk population. For

instance, to simulate commuting ﬂows according to residence distribution and workplace distri-

bution (Lu and Thill 2003); to simulate car accident points on the road network and adjust by

annual average daily trafﬁc (Yamada and Thill 2010). In addition to endpoint locations, the dis-

tribution of ﬂow length and ﬂow direction can also be conditional. Liu, Tong, and Liu (2015)

simulate a set of ﬂows by moving one ﬂow to another randomly selected ﬂow’s endpoint loca-

tion so that only ﬂows’ locations are changed while the lengths and directions are kept the

same. They propose another way by randomly pairing two points, one from observed origins

and the other from observed destinations, to form simulated ﬂows. This approach keeps end-

point locations the same but reshufﬂes the lengths and directions as opposed to the ﬁrst

Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection

approach. In sum, there is no unique way to simulate spatial ﬂows for signiﬁcance testing. It is

subject to the data to make appropriate assumption (e.g., restricted to at risk population). In

addition, is up to the analyst to choose which aspect to examine (e.g., to examine the contribu-

tion of ﬂow location to the general ﬂow clustering pattern by only randomizing location while

ﬁxing direction and length). Fundamentally cluster detection is an exploratory analysis. The

clusters identiﬁed can reﬂect the respective underlying geographical processes and can also

help us contemplate unknown ruling attributes contributing to the spatial pattern. The detailed

algorithm is presented step by step as follows.

Algorithm implementation

1. Calculate Flow Proximity

a. Prepare ﬂow events as vectors with the coordinates of origin and destination points.

For example, ﬂow Fiwith origin Oixi;yi

ðÞand destination Diui;vi

ðÞis formatted

as Fixi;yi;ui;vi

ðÞ:

b. Apply equation (1) or (2) to calculate the Flow Distance or Flow Dissimilarity

between every two ﬂows. Thus an Nby Ndistance matrix is computed for subsequent

use.

2. Calculate clustering detection statistics.

Calculate the local K-function using equation (3) for all the ﬂow events using a series

of scales rt(t51, 2, ..., 10; rt5r13t). The unit of r1is chosen on the proximity equation

used in previous step, for example, r150.1 mile along with equation (1); r150.1 along

with equation (2).

3. Evaluate statistical signiﬁcance.

a. Randomly simulate a set of Nﬂows in the study area.

b. Calculate the local K-function value for each simulated ﬂow same as step (1) and (2).

c. Repeat previous two steps 1,000 times.

d. Sort results of the 1,000-time simulations for each ﬂow at each scale. Set the smallest

and largest ones as the lower and upper envelopes (0.1% signiﬁcance level).

e. Compare the actual result with the corresponding signiﬁcance envelopes. If the

observed value surpasses the upper envelop, or is below the lower envelope, the

observed pattern is said to be clustered or dispersed, respectively.

4. Visualize and discuss the results.

Experimental study

Data description

In this study, we test the new ﬂow K-Function method and its algorithmic implementation

using a data set of vehicle theft and recovery location pairs in Charlotte, North Carolina. Given

the determinate relationship and chronological order of the data, the locations where theft hap-

pened and the places where the vehicles were recovered can be regarded as ﬂow origins and

destinations, respectively. According to the crime report released by the Charlotte-

Mecklenburg Police Department (CMPD), there were 14,064 vehicle theft cases within the city

from 09/01/2008 to 08/31/2014. Of all these cases, 6,960 have correct corresponding recovery

locations somewhere else in the city. In the data cleaning process, we excluded the records

with identical theft and recovery locations to exclude the cases of attempted break-ins, damage

Geographical Analysis

to the vehicle, interrupted stealing, or other incomplete theft crimes. The ﬁnal study data set

consists of 6,810 theft-recovery ﬂow events. From the map shown as Fig. 3 we can observe the

distribution of these locations. Overall, both theft and recovery locations have similar distribu-

tion across the city: there is a concentration around the city center, except for the southern por-

tion, which is known to encompass more afﬂuent neighborhoods.

To gain a more intuitive knowledge of the data we also estimated the kernel density

(KDE) for both sets of locations with a cell size of 400 square feet and bandwidth of 0.5 mile

(Fig. 4). The KDE maps indicate that many car thefts happened in the eastern and northern

areas near the city center, while a signiﬁcant part of them were recovered in the northwestern

region, where Charlotte Douglas International Airport is located. However, based on point pat-

tern analysis only, we can hardly build connections between theft locations and corresponding

recovery locations. According to popular criminological theories of vehicle theft crimes, such

as rational choice theory and routine activity theory, most criminals have meticulously

designed their target places and destination places in advance based on their cost-beneﬁt analy-

ses (Lu 2006). As the new trend indicates, more vehicles are stolen by criminal gangs for

money-making business rather than joy-riding (McGoey 2000). Thus it would be extremely

useful to discover the spatial patterns of how stolen vehicles are transported from their offense

place to their destination.

Following the complete algorithm given in the previous section, we implement our ﬂow

clustering detection approach on these crime data step by step. The null hypothesis of ﬂow dis-

tribution is that car thefts and recoveries can happen anywhere on the street network within the

Charlotte city limits. Therefore the 1,000 time Monte Carlo simulation is proceeded by

Figure 3. (a) Vehicle theft locations in Charlotte. (b) Vehicle recovery locations in

Charlotte.

Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection

randomly locating ﬂows’ endpoints on the city’s street network. The reason to choose such

assumption is that we have little prior knowledge about motor vehicle theft crime to add more

restrictions to the distribution of car theft and recovery event locations, or to the ﬂow lengths

and directions. Not imposing constraints on the spatial characteristics of ﬂows in the simulation

process has the advantage of not excluding any possible contributions to the ﬁnal cluster

results. Edge effects are corrected by reducing the analysis area by a distance equal to the larg-

est distance band used in the analysis (one mile in this case study). Only the ﬂows with both

endpoints within this shrunk area are selected to computing the algorithm, while the back-

ground ﬂow spatial process and the simulated ﬂows remain within the original area. The imple-

mentation program is written in C/C11 and parallel computing technique OpenMP is also

applied to accelerate computation, especially the simulation part. Results are visualized via

software ArcMap 10.1 and jFlowMap (Boyandin, Bertini, Lalanne 2010).

Results and discussion

Fig. 5 shows the local ﬂow clusters detected with our method at selected scales.

The ﬂows on

the maps represent the local clusters detected by our new approach as signiﬁcant at the 0.1%

level. Each ﬂow has one end colored in red to denote the theft location and the other end in

green to show the recovery location. To avoid visual clutter, we aggregate nearby ﬂow clusters

into the census block groups where their end points are situated.

The results are analyzed from two aspects. First, we compare the results obtained using the

same equation of ﬂow proximity measure. The ﬁrst three results use Flow Distance with scale

of different magnitudes, that is, 0.1, 0.2, and 0.3 of a mile. As the magnitude of the scale

Figure 4. (a) KDE estimation of theft locations. (b) Kernel density estimation of recovery

locations.

Geographical Analysis

increases, more ﬂows are detected as local clusters. The same pattern can be found in the other

set of results using Flow Dissimilarity. The variance caused by scale magnitude is consistent

with the basic feature of the K-function that the spatial pattern is partly dependent upon the

size of the detection window. The increasing number of local ﬂow clusters indicates that more

nearby ﬂows are included to contribute to the local K-function value as the detection window

becomes larger. At the same time, the increase of scale does not have an equivalent impact on

the background distribution which represents our null hypothesis. It is because we simulate the

background distribution by randomly placing the ﬂow events on the street network without fur-

ther speciﬁc control, for example, crime risk; therefore the simulated ﬂows are distributed

more sparsely throughout the city. As a result, the increase of scale has a positive impact on the

number of local ﬂow clusters that are detected. As in other K-function related research, choos-

ing the optimal magnitude of scale remains an open question. It is typically selected in relation

to how the results can make sense to explain context-dependent research questions. In this

case, Fig. 5f presents some interesting patterns about vehicle theft and recovery ﬂows. Vehicles

Figure 5. Detected ﬂow clusters using different ﬂow proximity measures. (a), (b), (c) use

Flow Distance (equation [1]) with detection scale equal to 0.1 mile, 0.2 mile, and 0.3 mile,

respectively. (d), (e), (f) use Flow Dissimilarity (equation [2]) with detection scale equal to

0.03, 0.04, and 0.05 respectively.

Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection

stolen from the area in the Southwestern section of the city are usually found somewhere far

away and their transport directions vary considerably. In addition, there is another group of

clusters in the Southeast showing much shorter transport distances and with similar directions

toward the North. One possible reason is that for the vehicles stolen in the Southwest area there

are only a few “favorable” places nearby for criminals to dispose of them. Therefore these cars

are transported over a long distance to places like chop shops for selling or to places like the

airport. Routine criminals who steal from the Southeast area may ﬁnd it much easier because

there are sites nearby in the North to dispose of the cars.

On the other hand, we can also compare the results using different types of ﬂow proximity

measures, namely the Flow Distance and Flow Dissimilarity. Comparing the two series of

maps in the top and bottom parts of Fig. 5 for a similar number of local clusters, the most

obvious difference is the average length of clustered ﬂows. The results using Flow Distance

contain many short ﬂows, while the results using Flow Dissimilarity tend to indicate longer

ﬂows as local clusters. Taking a closer look, we ﬁnd that some ﬂows—especially shorter

ones—within the same cluster identiﬁed using Flow Distance do not share many geographic

and geometric similarities with their neighboring ﬂows, for example, quite different ﬂow direc-

tions and ﬂow lengths. In contrast, ﬂows within the same cluster using Flow Dissimilarity tend

to be very similar to each other. The reason behind this difference is that, when ﬂow length is

not considered in measuring ﬂow proximity, short ﬂows need not be as similar in endpoint

locations, length and direction to each other as longer ones to have the same ﬂow distance.

Therefore, they are more readily detected as the locus of a signiﬁcant cluster than long ones, all

other things being equal. It results in false positive detection since some ﬂows are detected as

local clusters simply because they are short enough to be captured by the detection window.

On the contrary, local clusters identiﬁed with Flow Dissimilarity include ﬂows with close

vehicle theft sites, close vehicle recovery sites, and similar movement directionality and distan-

ces. The pattern is consistent throughout the study region. Moreover, the results would be of

practical use to law enforcement agencies to detect routine gang-related crimes with locational

preference for stealing and selling/disposing of vehicles in the city. As a conclusion, we argue

that the algorithm using Flow Dissimilarity to measure ﬂow proximity is less likely to lead to

false positive errors as it controls for one source of spurious cluster detection. Besides, it pro-

vides a meaningful alternative to the traditional distance scale in solving the instability or

inequality in cross-scale ﬂow clustering detection.

So far we have only discussed experiments with the basic version of the ﬂow proximity meas-

ures. Further usefulness of the measures can be explored by changing its parameter value. In both

equations (1) and (2), we specify two coefﬁcients, that is, aand b, to control the relative impor-

tance of origins and destinations. The expectation is that changing the relative value of these coefﬁ-

cients can purposely create a tendency for alternative cluster detection results. To test this

hypothesis, we adjust our approach by changing the coefﬁcient values in Flow Distance. We assign

a51:5 and b50:5fortheﬁrstgroupanda50:5 and b51:5 for the second. The sum of the

coefﬁcient values is controlled as 2, for the sake of the comparability of the results.

Fig. 6 includes two comparable result maps. Fig. 6a shows the clusters detected by the

Flow Dissimilarity with a51:5 and b50:5, while Fig. 6b shows the outcomes setting

a50:5 and b51:5, both using Flow Dissimilarity measure with a scale equal to 0.04. Compar-

ing these two maps and also comparing them with Fig. 5d for which a5b51 by default, we

ﬁnd that Fig. 6a contains more unique clusters with very close theft locations (red end) but rela-

tively distant recovery locations (green end), while Fig. 6b tends to show the opposite pattern.

Geographical Analysis

In other words, ﬂows with close theft locations are easy to be detected as clusters in Fig. 6a and

ﬂows with close recovery locations are favored in Fig. 6b. These observations are in line with

our premise that changing the value of Flow Distance coefﬁcients can lead to results with dif-

ferent emphases, which can cater to people with different interests. In terms of practical useful-

ness, citizens would be more interested in looking at Fig. 6a which can inform where vehicle-

theft crimes are more likely to happen so that they can avoid parking in these highly risky pla-

ces. On the contrary, police would ﬁnd Fig. 6b more useful in order to know where the concen-

trations of car-disposal places are and where they should search for the lost vehicles. By

comparing the result maps with Google Maps we found that the neighborhoods surrounding the

main campus of UNC Charlotte correspond to the cluster of theft sites in the northeastern part

of Fig. 6a, which indicates that this area is a popular car theft locus. Some clusters of recovery

places near the city center in Fig. 6b match the locations of savage vehicle yards or chop shops,

where stolen cars can be quickly transacted with cash and be sold again in parts.

Conclusions

Spatial statistical approaches to clustering detection have been continuously developed for dec-

ades. In contrast with abundant methods designed for point and polygon data, approaches well

suited to handling spatial ﬂow data have not been well developed so far. To ﬁll this gap and

also to meet the challenges brought by the emerging breadth of massive ﬂow data, this research

has developed an innovative spatial statistical method for ﬂows. A pair of particular spatial

proximity measures called the Flow Distance and Flow Dissimilarity have been designed.

Based on these measures the local version of the K-function is adjusted and implemented to

Figure 6. Flow clusters with different endpoint emphases. (a) Clusters more focused on theft

locations (a51:5;b50:5). (b) Clusters more focused on recovery locations (a50:5;b51:5).

Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection

examine the second-order effects of spatial ﬂows. By comparing the observed local K-function

value with the statistical conﬁdence envelops generated via Monte Carlo simulation, the local

clustering pattern of each ﬂow event can be identiﬁed at a certain statistical signiﬁcance level.

The new method is an intuitive extension of the principles embedded in the K-function for one-

dimensional point events and is applicable to all types of ﬂow data.

To test the effectiveness and usefulness of our method, a series of experiments have been

implemented using a real data set of vehicle theft-recovery ﬂows in Charlotte, NC. The results

demonstrate that our method is capable of identifying local clusters from the several thousands of

tangled ﬂows. Speciﬁcally, the measures we designed proved not only to be measures of spatial

proximity, but an effective solution for the inclusion of the multilocation interaction objects

within the scope of well-developed point pattern spatial statistics, namely the local K-function.

By adjusting the parameters of endpoint coordinate pairs, the study emphasis can be purposely

placed on the spatial associations between either ﬂow origins or ﬂow destinations. In addition,

the impact of ﬂow length has also been thoroughly discussed. To overcome the statistical bias

brought by ﬂow lengths, we introduced a variant of Flow Distance called Flow Dissimilarity.

The experiment shows that the algorithm using Flow Dissimilarity leads to more stable spatial

patterns and is adaptive to ﬂows with varied lengths across the study region. Overall, the method

designed in this research has fully utilized the spatial characteristics of ﬂow data, and it is demon-

strated to be capable of investigating spatial associations of ﬂow events across scales. The results

examined with this method have practical implications as well. In this vehicle-theft crime exam-

ple, it can inform not only where frequent car theft and recovery happen, but how the stolen cars

are moved from one place to another in the form of spatial ﬂow clusters. The results are espe-

cially useful to devise effective police responses to routine gang crime activities.

The proposed analytic method can be extended in several ways. First, further work can be

done to expand the capability of this method to include additional event characteristics, for

example considering ﬂow type and value in “hot ﬂow” detection. A plausible idea is to use the

local cross K-function (Boots and Okabe 2007) instead of the traditional local K-function to

detect clusters of ﬂows with different types, for example, rescue goods ﬂow spatially associated

with refugee ﬂow; and to accumulate the total value of nearby ﬂows instead of simply tallying

their frequency in calculating the local K-function so as to adjust the contribution of ﬂows with

unequal value, for example, a one-thousand-people commuting ﬂow versus a single-person

commuting ﬂow. Also, we believe that the Flow Distance and Flow Dissimilarity measures can

be shown to be effective with other methods of exploratory spatial data analysis including the

local Moran’s I and G statistics for ﬂow data analysis. Furthermore, we envision that the princi-

ples of the ﬂow proximity measure can be further expanded to higher dimensionality for the

space-time analysis of ﬂow data, or to other kinds of spatial analyses, for example spatial inter-

action modeling and trajectory data analysis. Lastly, combining this spatial statistical method

with other fast-developing techniques is also very meaningful. GeoComputation, GeoVisuali-

zation, and spatial data mining are all powerful methods that complement conﬁrmatory statisti-

cal analysis, especially in this “Big Data” era.

Note

1 The observed global K-function for this dataset is above the 0.01 upper envelope at most scales. To bet-

ter demonstrate the capability of our new local ﬂow clustering statistics, we report results for selected

scales within the range of statistical signiﬁcance.

Geographical Analysis

References

Aldstadt, J., and A. Getis. (2006). “Using AMOEBA to Create a Spatial Weights Matrix and Identify Spa-

tial Clusters.” Geographical Analysis 38(4), 327–43.

Anselin, L. (1995). “Local Indicators of Spatial Association–LISA.” Geographical Analysis 27(2),

93–115.

Anselin, L., I. Y. Syabri, and Kho. (2006). “GeoDa: An Introduction to Spatial Data Analysis.” Geograph-

ical Analysis 38(1), 5–22.

Berglund, S., and A. Karlstr€

om (1999). “Identifying Local Spatial Association in Flow Data.” Journal of

Geographical Systems 1(3), 219–36.

Besag, J., and J. Newell. (1991). “The Detection of Clusters in Rare Diseases.” Journal of the Royal Sta-

tistical Society Series A 154(1), 143–55.

Boots, B., and Okabe, A. (2007). “Local Statistical Spatial Analysis: Inventory and Prospect.” Interna-

tional Journal of Geographical Information Science 21(4), 355–75.

Boyandin, I., E. Bertini, and D. Lalanne. (2010). “Using Flow Maps to Explore Migrations over Time.” In

Geospatial Visual Analytics Workshop in Conjunction with The 13th AGILE International Confer-

ence on Geographic Information Science. Guimar~

aes, Portugal, 2(3).

Chen, J., R. Wang, L. Liu, and J. Song. (2011). “Clustering of Trajectories Based on Hausdorff Distance.”

2011 International Conference on Electronics, Communications and Control (ICECC), Ningbo,

China, 1940–44.

Cressie, N. (1993). Statistics for Spatial Data. New York: Wiley.

Cui, W., H. Zhou, H. Qu, P. C. Wong, and X. Li. (2008). “Geometry-Based Edge Clustering for Graph

Visualization.” IEEE Transactions on Visualization and Computer Graphics 14(6), 1277–84.

Diggle, P. (1983). Statistical Analysis of Spatial Point Patterns. London: Academic Press.

Fortin, M., and Dale, M. (2009). “Spatial Autocorrelation.” In The SAGE Handbook of Spatial Analysis,

89–103, edited by S. Fotheringham and P. Rogerson. London: Sage

Fotheringham, S. (1997). “Trends in Quantitative Methods I: Stressing the Local.” Progress in Human

Geography 21(1), 88–96.

Fotheringham, S., and B. Zhan. (1996). “A Comparison of Three Exploratory Methods for Cluster Detec-

tion in Spatial Point Patterns.” Geographical Analysis 28(3), 200–18.

Geary, R. (1954). “The Contiguity Ratio and Statistical Mapping.” The Incorporated Statistician (The

Incorporated Statistician) 5(3), 115–45.

Genolini, C., and B. Falissard. (2010). “KmL: K-Means for Longitudinal Data.” Computational Statistics

25(2), 317–28.

Getis, A., and J. Franklin. (1987). “Second-Order Neighborhood Analysis of Mapped Point Patterns.”

Ecology 68, 473–77.

Getis, A., and J. Ord. (1992). “The Analysis of Spatial Association by Use of Distance Statistics.” Geo-

graphical Analysis 24(3), 189–206.

Guo, D. (2009). “Flow Mapping and Multivariate Visualization of Large Spatial Interaction Data.” IEEE

Transactions on Visualization and Computer Graphics 15(6), 1041–48.

Kulldorff, M. (1997). “A Spatial Scan Statistic.” Communications in Statistics - Theory and Methods

26(6), 1481–96.

Lee, J. G., J. Han, and K. Y. Whang. (2007). “Trajectory Clustering: A Partition-and-Group Framework.” In

Proceedings of the 2007 ACM SIGMOD international conference on Management of data.Beijing,

China 593–604.

Liu, Y., D. Tong, and X. Liu. (2015). “Measuring Spatial Autocorrelation of Vectors.” Geographical

Analysis. 47(3), 300–319.

Lu, Y. (2006). “Spatial Choice of Auto Thefts in an Urban Environment.” Security Journal 19 (3),

143–166.

Lu, Y., and J.-C. Thill. (2003). “Assessing the Cluster Correspondence between Paired Point Locations.”

Geographical Analysis 35(4), 290–309.

Lu, Y., and J.-C. Thill. (2008). “Cross-scale Analysis of Cluster Correspondence Using Different Opera-

tional Neighborhoods.” Journal of Geographical Systems 10(3), 241–61.

McGoey, C. (2000). “Auto Theft Facts.” www.crimedoctor.com/autotheft1.htm

Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection

Moran, P. (1950). “Notes on Continuous Stochastic Phenomena.” Biometrika 37(1), 17–23.

Murray, A., Y. Liu, S. J. Rey, and L. Anselin (2011). “Exploring Movement Object Patterns.” The Annals

of Regional Science 49(2), 471–84.

Nanni, M., and Pedreschi, D. (2006). “Time-Focused Clustering of Trajectories of Moving Objects.”

Journal of Intelligent Information Systems 27(3), 267–289.

Okabe, A., B. Boots, and T. Satoh. (2010). “A Class of Local and Global K-functions and Their Exact Sta-

tistical Methods.” Perspectives on Spatial Data Analysis. 101–12. edited by L. Anselin and S. J. Rey.

Berlin, Heidelberg: Springer.

Openshaw, S., M. Charlton, C. Wymer, and A. Craft. (1987). “A Mark 1 Geographical Analysis Machine

for the Automated Analysis of Point Data Sets.” International Journal of Geographical Information

Systems 1(4), 335–58.

Ord, J., and A. Getis. (1995). “Local Spatial Autocorrelation Statistics: Distributional Issues and an

Application.” Geographical Analysis 27(4), 286–306.

Ossama, O., H. Mokhtar, and M. El-Sharkawi (2011). “Clustering Moving Objects Using Segments

Slopes.” International Journal of Database Management Systems 3(1), 35–48.

Ripley, B. D. (1976). “The Second-Order Analysis of Stationary Point Processes.” Journal of Applied

Probability 13, 255–66.

Symanzik, J. 2014. “Exploratory Spatial Data Analysis.” In Handbook of Regional Science, 1295–310,

edited by F. Manfred and N. Peter. Heidelberg, Germany: Springer.

Tobler, W. R. (1987). “Experiments in Migration Mapping by Computer.” The American Cartographer

14, 155–63.

Waller, L. (2009). “Detection of Clustering in Spatial Data.” In The SAGE Handbook of Spatial Analysis,

159–81, edited by S. Fotheringham and P. Rogerson. London: Sage.

Yamada, I., and J.-C. Thill. (2007) “Local Indicators of Network-Constrained Clusters in Spatial Point

Patterns.” Geographical Analysis 39(3), 268–92.

Yamada, I., and J.-C. Thill. (2010). “Local Indicators of Network-Constrained Clusters in Spatial Patterns

Represented by a Link Attribute.” Annals of the Association of American Geographers 100(2),

269–85.

Zhu, X., and D. Guo. (2014). “Mapping Large Spatial Flow Data with Hierarchical Clustering.” Transac-

tions in GIS 18 (3), 421–35.

Geographical Analysis

Modelling and application of a spectral clustering method for shared bicycle trajectories

Article

Full-text available

Jan 2024

Geographic flow clustering analysis can effectively reveal human behavioral patterns in movement. Traditional methods for studying human movement patterns are mostly based on first-order quantity analyses of point data, such as hotspots, density or clustering. Currently, relatively few second-order spatial analysis methods based on geographic flows exist. Thus, we developed a new geographic flow method based on spectral clustering and applied it to trajectory data analysis. This article uses the bike-sharing trajectories data in Shanghai in August 2016, spectral clustering analysis was conducted on the group flow patterns before, during and after rainfall, on weekdays and weekends and in the morning and evening peak. Spectral clustering was verified to exhibit better clustering effect by comparing the clustering indices of different clustering methods. This study enriches the analysis method of geographical flows, and the human mobility patterns revealed by its analysis can provide references for formulating urban green travel policies.

Deciphering flow clusters from large-scale free-floating bike sharing journey data: a two-stage flow clustering method

Article

Full-text available

Aug 2023
TRANSPORTATION

Extracting flow clusters consisting of many similar origin–destination (OD) trips is essential to uncover the spatio-temporal interactions and mobility patterns in the free-floating bike sharing (FFBS) system. However, due to occlusion and display clutter issues, efforts to identify inhomogeneous flow clusters from large journey data have been hampered to some extent. In this study, we present a two-stage flow clustering method, which integrates the Leiden community detection algorithm and the shared nearest-neighbor-based flow (SNN_flow) clustering method to efficiently identify flow clusters with arbitrary shapes and uneven densities. The applicability and performance of the method in detecting flow clusters are investigated empirically using the FFBS system of Nanjing, China as a case study. Some interesting findings can be drawn from the spatio-temporal patterns. For instance, the share of flow clusters used to meet the “first-/last-mile” demand at metro stations is reasonably high, both during the morning (71.85%) and evening (65.79%) peaks. Compared with the “first-/last-mile” flow clusters between metro stations and adjacent workplaces, the solution of the “first-/last-mile” flow clusters between metro stations and adjacent residences is more dependent on the FFBS system. In addition, we explored the shape and density distribution of flow clusters from the perspective of origin and destination points. The endpoint distribution characteristics demonstrate that the shape distribution of metro station point clusters is generally flatter and the spatial points within them are more concentrated than other sorts of point clusters. Our findings could help to better understand human movement patterns and home-work commute, thereby providing more rational and targeted decisions for allocating FFBS infrastructure resources.

Flow Spatiotemporal Moran's I : Measuring the Spatiotemporal Autocorrelation of Flow Data

Article

Mar 2024

Flows can reflect the spatiotemporal interactions or movements of geographical objects between different locations. Measuring the spatiotemporal autocorrelation of flows can help determine the overall spatiotemporal trends and local patterns. However, quantitative indicators of flows used to measure spatiotemporal autocorrelation both globally and locally are still rare. Therefore, we propose the global and local flow spatiotemporal Moran's I (FSTI). The global FSTI is used to assess the overall spatiotemporal autocorrelation degree of flows, and the local FSTI is applied to identify local spatiotemporal clusters and outliers. In the FSTI, to reflect flow spatiotemporal adjacency relationships, we establish flow spatiotemporal weights by multiplying the spatial and temporal weights of flows considering spatiotemporal orthogonality. The flow spatial weights include contiguity‐based (considering first/higher‐order and common border) and Euclidean distance‐based weights. The temporal weights consider ordinary and lagged cases. As flow attributes may follow a long‐tail distribution, we conduct Monte Carlo simulations to evaluate the statistical significance of the results. We assess the FSTI using synthetic datasets and Chinese population mobility datasets, and compare some results with those of recent flow‐related methods. Additionally, we perform a sensitivity analysis to select a suitable temporal threshold. The results show that the FSTI can be used to effectively detect spatiotemporal variations in the autocorrelation degree and type.

A space-time flow LISA approach for panel flow data

Article

Sep 2023
COMPUT ENVIRON URBAN

Spatial flow data represent meaningful spatial interaction (SI) phenomena between geographic regions that are often highly dynamic. However, most existing flow analytical methods are cross-sectional, and there is a lack of methods to measure spatiotemporal autocorrelation of flow data. To fill this gap, we proposed a new localized spatial statistical method called Space-Time Flow LISA. The method design is a combination of two existing method families, namely space-time LISA and Spatial Flow LISA. A critical component of the method is the space-time weight matrix of flow data that blends pairwise spatial and temporal connectivities. We design three versions of the matrix, namely contemporaneous, lagged, and hybrid. We evaluate the method using both synthetic data and a case study of U.S. interstate migration from 2005 to 2017. The method is found to have high efficacy in finding spatiotemporal local autocorrelation patterns. Unlike the Spatial Flow LISA that tends to detect short-distance migration corridor havens (‘HH’ flows) and long-distance migration corridor deserts (‘LL’ flows), the Space-Time Flow LISA is less impeded by the distance between flow origin and destination, as they can pick up local patterns that are less spatially explicit but temporally dependent. In addition, the new method is able to detect time-sensitive patterns such as the outmigration from Louisiana forced by Hurricane Katrina in 2005. By integrating spatial, temporal, and attributive associations into a one-step analysis, the proposed Space-Time Flow LISA can illustrate the spatiotemporal structure of flow phenomena well, and reveal dynamic distribution changes over time.

Strength-weighted flow cluster method considering spatiotemporal contiguity to reveal interregional association patterns

Article

Full-text available

Sep 2023

One of the most crucial topics in spatial interaction studies is mining patterns from extensive origin-destination (OD) flow data to capture interregional associations. However, prevailing methodologies tend to disregard the importance of using the relative closeness of interregional connections as weights, treat spatial and temporal dimensions independently, or overlook the temporal dimension completely. Consequently, the identified patterns are susceptible to inaccuracies, and the precise identification of pattern occurrence time and duration, despite their fundamental importance, remains elusive. In light of these challenges, this study proposes a strategy to calculate and combine the strength of weighted spatiotemporal flows, and develops a clustering method and evaluation metrics based on this framework. Compared to alternative density-based methods, the strength-based calculation approach demonstrates a capacity to identify flow patterns characterized by relatively high interregional closeness. Thus, the identification of flow patterns expands beyond density-based approaches, encompassing strength-based considerations and a shift from absolute to relative closeness between regions. Experiments using synthetic datasets conducted in this research demonstrate the effectiveness, efficiency, and extraction accuracy of the proposed method. Furthermore, a case study using real Chinese population migration data demonstrates the efficacy of the method in revealing implicit spatiotemporal association patterns between regions. The present study implements an interaction strength-based flow clustering and evaluation method that considers spatiotemporal continuity, making it applicable to spatial flow data analysis involving interaction volume and time attributes. As a result, this method holds promise for facilitating the modeling of intricate spatial flows within various contexts of study.

Rethinking the null hypothesis in significant colocation pattern mining of spatial flows

Article

Full-text available

May 2024
J GEOGR SYST

Spatial flows represent spatial interactions or movements. Mining colocation patterns of different types of flows may uncover the spatial dependences and associations among flows. Previous studies proposed a flow colocation pattern mining method and established a significance test under the null hypothesis of independence for the results. In fact, the definition of the null hypothesis is crucial in significance testing. Choosing an inappropriate null hypothesis may lead to misunderstandings about the spatial interactions between flows. In practice, the overall distribution patterns of different types of flows may be clustered. In these cases, the null hypothesis of independence will result in unconvincing results. Thus, considering the overall spatial pattern of flows, in this study, we changed the null hypothesis to random labeling to establish the statistical significance of flow colocation patterns. Furthermore, we compared and analyzed the impacts of different null hypotheses on flow colocation pattern mining through synthetic data tests with different preset patterns and situations. Additionally, we used empirical data from ride-hailing trips to show the practicality of the method.

Beekeeping Behavior of Chinese Beekeepers Shows Spatial Contraction

Article

Full-text available

Mar 2024

Apiculture is an important industry closely related to the national economy and people’s livelihoods. Beekeepers’ behavior is an important factor affecting the yield, quality, and benefits of apiculture. However, there is a lack of a systematic understanding of the long-term changes in beekeeping decisions made by beekeepers. Using panel data, we analyzed the dynamic trends and related influencing factors of decisions made by beekeeping models, honey source plant selection, and the migration flow space of beekeepers from 2009 to 2020. The results showed that the proportion of the LMB model decreased, while the PAB and SMB models continued to increase, the frequency of utilization of the main nectar source plants for honey collection decreased, and the concentration of migratory flow of beekeeping increased. Behavior of beekeepers from 2009 to 2020 showed a certain degree of spatial contraction, which seriously restricted the effective use of nectar plant resources. Family attributes, economic status, beekeeping models, and disaster conditions directly or indirectly affected beekeepers’ decisions. We propose a series of recommendations to facilitate the transformation and advancement of the Chinese bee industry. This study promotes an understanding of sustainable development of the bee industry in China and other countries worldwide.

Understanding Spatial Dependency Among Spatial Interactions

Chapter

Apr 2024

Length-squared L-function for identifying clustering pattern of network-constrained flows

Article

Full-text available

Oct 2023

The network-constrained flow is defined as the movement between two locations along road networks, such as the residence-workplace flow of city dwellers. Among flow patterns, clustering (i.e. the origins and destinations are aggregated simultaneously) is one of the cities’ most common and vital patterns, which assists in uncovering fundamental mobility trends. The existing methods for detecting the clustering pattern of network-constrained flows do not consider the impact of road network topology complexity on detection results. They may underestimate the flow clustering between networks of simple topology (roads with simpler shapes and fewer links, e.g. straight roads) but with high network intensity (i.e. flow number per network flow space), and determining the actual scale of an observed pattern remains challenging. This study develops a novel method, the length-squared L-function, to identify clustering patterns of network-constrained flows. We first use the L-function and its derivative to examine the clustering scales. Further, we calculate the local L-function to ascertain the locations of the clustering patterns. In synthetic and practical experiments, our method can identify flow clustering patterns of high intensities and reveal the residents’ main travel behavior trends with taxi OD flows, providing more reasonable suggestions for urban planning.

A kriging interpolation model for geographical flows

Article

Aug 2023

Mapping Large Spatial Flow Data with Hierarchical Clustering

Article

Full-text available

Jun 2014

It is challenging to map large spatial flow data due to the problem of occlusion and cluttered display, where hundreds of thousands of flows overlap and intersect each other. Existing flow mapping approaches often aggregate flows using predetermined high-level geographic units (e.g. states) or bundling partial flow lines that are close in space, both of which cause a significant loss or distortion of information and may miss major patterns. In this research, we developed a flow clustering method that extracts clusters of similar flows to avoid the cluttering problem, reveal abstracted flow patterns, and meanwhile preserves data resolution as much as possible. Specifically, our method extends the traditional hierarchical clustering method to aggregate and map large flow data. The new method considers both origins and destinations in determining the similarity of two flows, which ensures that a flow cluster represents flows from similar origins to similar destinations and thus minimizes information loss during aggregation. With the spatial index and search algorithm, the new method is scalable to large flow data sets. As a hierarchical method, it generalizes flows to different hierarchical levels and has the potential to support multi-resolution flow mapping. Different distance definitions can be incorporated to adapt to uneven spatial distribution of flows and detect flow clusters of different densities. To assess the quality and fidelity of flow clusters and flow maps, we carry out a case study to analyze a data set of 243,850 taxi trips within an urban area.

A Spatial Scan Statistic

Article

Full-text available

Jun 1997

Martin Kulldorff

The scan statistic is commonly used to test if a one dimensional point process is purely random, or if any clusters can be detected. Here it is simultaneously extended in three directions:(i) a spatial scan statistic for the detection of clusters in a multi-dimensional point process is proposed, (ii) the area of the scanning window is allowed to vary, and (iii) the baseline process may be any inhomogeneous Poisson process or Bernoulli process with intensity pro-portional to some known function. The main interest is in detecting clusters not explained by the baseline process. These methods are illustrated on an epidemiological data set, but there are other potential areas of application as well.

Local indicator of spatial association-LISA

Article

Jan 1995

Luc Anselin

The analysis of spatial association by use of distance statistics

Article

Jan 1992

Introduced in this paper is a family of statistics, G, that can be used as a measure of spatial association in a number of circumstances. The basic statistic is derived, its properties are identified, and its advantages explained. Several of the G statistics make it possible to evaluate the spatial association of a variable within a specified distance of a single point. A comparison is made between a general G statistic andMoran’s I for similar hypothetical and empirical conditions. The empiricalwork includes studies of sudden infant death syndrome by county in North Carolina and dwelling unit prices in metropolitan San Diego by zip-code districts. Results indicate that G statistics should be used in conjunction with I in order to identify characteristics of patterns not revealed by the I statistic alone and, specifically, the Gi and G∗ i statistics enable us to detect local “pockets” of dependence that may not show up when using global statistics.

The second-order analysis of stationary point processes

Article

Jun 1976

B. D. Ripley

This paper provides a rigorous foundation for the second-order analysis of stationary point processes on general spaces. It illuminates the results of Bartlett on spatial point processes, and covers the point processes of stochastic geometry, including the line and hyperplane processes of Davidson and Krickeberg. The main tool is the decomposition of moment measures pioneered by Krickeberg and Vere-Jones. Finally some practical aspects of the analysis of point processes are discussed.

Assessing the cluster correspondence between paired point locations

Article

Oct 2003

Some complex geographic events are associated with multiple point locations. Such events include, but are not limited to, those describing linkages between and among places. The term multi-location event is used in the paper to refer to these geographical phenomena. Through formalization of the multi-location event problem, this paper situates the analysis of multi-location events within the broad context of point pattern analysis techniques. Two alternative approaches (vector autocorrelation analysis and cluster correspondence analysis) to the spatial dependence of paired-location events (i.e., two-location events) are explored, with a discussion of their appropriateness to general multi-location event problems. The research proposes a framework of cluster correspondence analysis for the detection of local non-stationarities in the spatial process generating multi-location events. A new algorithm for local analysis of cluster correspondence is proposed. It is implemented on a large-scale dataset of vehicle theft and recovery location pairs in Buffalo, New York.

A mark I geographical analysis machine for the automated analysis of point data sets

Article

Jan 1987
Int J Geogr Inform Syst

Exploratory Spatial Data Analysis

Chapter

Jan 2014

Jürgen Symanzik

In this chapter, we discuss key concepts for exploratory spatial data analysis (ESDA). We start with its close relationship to exploratory data analysis (EDA) and introduce different types of spatial data. Then, we discuss how to explore spatial data via different types of maps and via linking and brushing. A key technique for ESDA is local indicators of spatial association (LISA). ESDA needs to be supported by software. We discuss two main lines of software developments: GIS-based solutions and stand-alone solutions.

Measuring Spatial Autocorrelation of Vectors

Article

Dec 2014
GEOGR ANAL

This article introduces measures to quantify spatial autocorrelation for vectors. In contrast to scalar variables, spatial autocorrelation for vectors involves an assessment of both direction and magnitude in space. Extending conventional approaches, measures of global and local spatial associations for vectors are proposed, and the associated statistical properties and significance testing are discussed. The new measures are applied to study the spatial association of taxi movements in the city of Shanghai. Complications due to the edge effect are also examined.

Clustering of trajectories based on Hausdorff distance

Conference Paper

Sep 2011

Spatio-temporal and geo-referenced datasets are growing rapidly, with the rapid development of some technology, such as GPS, satellite systems. At present, many scholars are very interested in the clustering of the trajectory. Existing trajectory clustering algorithms group similar trajectories as a whole and can't distinguish the direction of trajectory. Our key finding is that clustering trajectories as a whole could miss common sub-trajectories and trajectory has direction information. In many applications, discovering common sub-trajectories is very useful. In this paper, we present a trajectory clustering algorithm CTHD (clustering of trajectory based on hausdorff distance). In the CTHD, the trajectory is firstly described by a sequence of flow vectors and partitioned into a set of sub-trajectory. Next the similarity between trajectories is measured by their respective Hausdorff distances. Finally, the trajectories are clustered by the DBSCAN clustering algorithm. The proposed algorithm is different from other schemes using Hausdorff distance that the flow vectors include the position and direction. So it can distinguish the trajectories in different directions. The experimental result shows the phenomenon.

Spatial Cluster Detection in Spatial Flow Data

Abstract and Figures

Recommended publications

Robust Markov chain Monte Carlo Methods for Spatial Generalized Linear Mixed Models

Intra-metropolitan spatial patterns of female labor force participation and commute times in Tokyo

Superpixel-based active contour model for unsupervised change detection from satellite images

AUTOCORRELACIÓN ESPACIAL: ANALOGÍAS Y DIFERENCIAS ENTRE EL INDICE DE MORAN Y EL INDICE GETIS Y ORD....