ArticlePDF Available

Spatial Cluster Detection in Spatial Flow Data

Authors:

Abstract and Figures

As a typical form of geographical phenomena, spatial flow events have been widely studied in contexts like migration, daily commuting, and information exchange through telecommunication. Studying the spatial pattern of flow data serves to reveal essential information about the underlying process generating the phenomena. Most methods of global clustering pattern detection and local clusters detection analysis are focused on single-location spatial events or fail to preserve the integrity of spatial flow events. In this research we introduce a new spatial statistical approach of detecting clustering (clusters) of flow data that extends the classical local K-function, while maintaining the integrity of flow data. Through the appropriate measurement of spatial proximity relationships between entire flows, the new method successfully upgrades the classical hot spot detection method to the stage of “hot flow” detection. Several specific aspects of the method are discussed to provide evidence of its robustness and expandability, such as the multiscale issue and relative importance control, using a real data set of vehicle theft and recovery location pairs in Charlotte, NC.
Content may be subject to copyright.
Spatial Cluster Detection in Spatial Flow Data
Ran Tao, Jean-Claude Thill
Department of Geography and Earth Sciences and Project Mosaic, University of North Carolina at
Charlotte, Charlotte, NC
As a typical form of geographical phenomena, spatial flow events have been widely stud-
ied in contexts like migration, daily commuting, and information exchange through tele-
communication. Studying the spatial pattern of flow data serves to reveal essential
information about the underlying process generating the phenomena. Most methods of
global clustering pattern detection and local clusters detection analysis are focused on
single-location spatial events or fail to preserve the integrity of spatial flow events. In this
research we introduce a new spatial statistical approach of detecting clustering (clusters)
of flow data that extends the classical local K-function, while maintaining the integrity of
flow data. Through the appropriate measurement of spatial proximity relationships
between entire flows, the new method successfully upgrades the classical hot spot detec-
tion method to the stage of “hot flow” detection. Several specific aspects of the method
are discussed to provide evidence of its robustness and expandability, such as the multi-
scale issue and relative importance control, using a real data set of vehicle theft and
recovery location pairs in Charlotte, NC.
Introduction
Spatial flows, also known as interactions between georeferenced places, constitute an enduring
object of research in spatial sciences. A flow event in geography typically consists of two basic
components, namely the spatial one, represented as a vector, and the aspatial component, which
encapsulates the type or value it represents. Common examples include migration flows, daily com-
muting flows, international trade flows, and flows of information exchanged through telecommuni-
cation. In general, there are two types of flow data, namely individual flows and aggregated flows
(Murray et al. 2011). The former pertain to individual activities, for example one person taking the
subway from home to work on a weekday morning. In contrast, the latter represent the movement
or interactions of a group of people or objects, for example a group of elks residing in the northern
section of Yellowstone National Park and migrating to lower altitudes before winter arrives.
Correspondence: Ran Tao, Department of Geography and Earth Sciences and Project Mosaic, Univer-
sity of North Carolina at Charlotte, Charlotte, NC
e-mail: rtao2@uncc.edu
[Correction added on 1 June 2016, after first online publication: the publisher apologizes for the wrong
version of this article being inadvertently published due to a technical error. Corrections for clarity
have been made throughout the article in the text, equations and references, without impacting the
results or conclusions of the study].
Submitted: March 05, 2015. Revised version accepted: February 01, 2016.
doi: 10.1111/gean.12100 1
V
C2016 The Ohio State University
Geographical Analysis (2016) 00, 00–00
Understanding the pattern and dynamics of spatial flows has been a long standing goal of
spatial scientists. With the fast development in sensor and GPS technologies in recent years, large
volumes of spatiotemporal data have become available with fine granularity. In addition, emerg-
ing types of interactive activities, like information exchange on social media networking, enhance
the richness of flow events. The increased availability of massive volumes of new forms of flow
data inevitably brings unprecedented opportunities to enrich our understanding of patterns and
processes embedded in the geographic space, but this also presents new analytical challenges at
several levels. First, there is the challenge to develop advanced methods to generalize and extract
useful information from massive flow data; next, the challenge to conceive new visualization
approaches to represent flows more effectively; also, to design handy and highly interactive tools
to incorporate flow data into geospatial information systems; and finally, to build spatial interac-
tion models to understand the nature behind locational choices and their relationships. Among
these endeavors, detecting spatial distribution patterns globally or locally, that is, clustered, scat-
tered, or random, across the spatial extent has garnered a lot of attention. While many contribu-
tions have used techniques such as Spatial Data Mining, Geovisualization, and Graph Theory
(Tobler 1987; Cui et al. 2008; Guo 2009; Zhu and Guo 2014) to better handle the large data vol-
ume, we contend that spatial statistics has not shown its full potentials for the detection of spatial
distribution patterns of flow data, in spite of the abundance of effective spatial statistics techni-
ques that have been devised to deal with spatial point data, spatial line segment data, and spatial
polygon data (e.g., Moran’s I (Moran 1950), Geary’s C (Geary 1954), Getis and Ord’s G (Getis
and Ord 1992; Ord and Getis 1995), Ripley’s K-function [Ripley 1976]). Thus, it is the purpose
of this study to develop novel spatial statistical approaches to detect spatial clustering patterns in
flow data with the aim of understanding their spatial relationships, while preserving the integrity
of the flow data. To this end, we introduce new spatial proximity measures tailored for flow data,
on the basis of which we extend the well-known point data analysis method, namely the local
Ripley’s K-function, to the spatial flow context. The new approach is presented and the evidence
of its robustness and efficiency is provided via experiments on a real data set.
The rest of this article is organized as follows. In the second section a brief literature
review is provided, which covers previous studies on spatial clustering detection especially
those pertaining to flow data. Then a thorough explanation of our new approach is presented,
including both the theoretical foundations and the technical details. The fourth section consists
of experiments with real data, along with evaluations of the performance of the proposed ana-
lytical method. We conclude with a discussion of the main characteristics and contribution of
our method, as well as proposed future extensions.
Literature review
Given the general tendency of spatial phenomena to co-occur spatially as encapsulated by
Tobler’s First Law of Geography (1970), spatial clustering is one of the most common spatial
patterns of point events. It represents a general tendency of events occurring closer to each
other than one might expect by chance (Waller 2009). An extensive body of literature on clus-
ter detection and monitoring exists that has advanced various methods to identify such pattern.
Several excellent references provide overviews of the concepts and methods involved (e.g.,
Diggle 1983; Cressie 1993; Fortin and Dale 2009; Symanzik 2014).
Early studies were mostly concerned with the overall spatial pattern exhibited by the
events and devised spatial statistics as a single index, sometimes labeled as “global” statistics,
Geographical Analysis
2
to depict the nature of events and of the spatial process producing a certain spatial distribution
within the entire study area. Well-known examples include Moran’s I, Geary’s C, Quadrat
Analysis, Nearest Neighbor Index, and Ripley’s K-function. However, one of the fundamental
assumptions of these methods, namely the spatial stationarity, is difficult to comply with in
many real situations. Furthermore, a single statistic does not allow to further investigate more
detailed patterns and relationships such as how the spatial process associated with one variable
would be dependent on others (Fotheringham 1997). To cope with such issues, spatial pattern
analysis has shifted toward the development of local statistics for detecting spatial clusters. In
contrast with global spatial clustering methods that are designed to identify whether there exists
a general tendency for events to occur nearer other events than expected by chance, techniques
for localized cluster detection are aimed at finding anomalies and interesting collections of spa-
tial events within the study area that appear to be inconsistent with the background conceptual
model of how events arise (Besag and Newell 1991; Waller 2009). Notable approaches include
the geographical analysis machine (GAM) (Openshaw et al. 1987) and its derivative methods
(Besag and Newell 1991; Fotheringham and Zhan 1996), the local version of Ripley’s K-
function (Getis and Franklin 1987), local indicators of spatial association (LISA) especially the
local Moran’s I statistic, local Geary’s C (Anselin 1995), and local G statistic (Getis and Ord
1992; Ord and Getis 1995). Some local detection methods around predetermined locations are
called “focused tests” to differentiate them from those based on randomly chosen event loca-
tions (Besag and Newell 1991). The local Cross K-function is such a focused test to identify
clusters of events around specific locations, such as crime instances around railway stations or
shopping malls (Boots and Okabe 2007). Regardless of the technical details, local cluster
detection methods all hold the advantages that they can better integrate with the fast-
developing GeoComputation technology to handle large data sets and their results can be well
illustrated with the visualization and mapping capabilities of Geographic Information Systems
(GIS) (Fotheringham 1997). Recent contributions come from both the methodological develop-
ment perspective, such as the network-constrained local K-function and local Moran’s I
(Yamada and Thill 2007, 2010), the Multidirectional Optimum Ecotope-Based Algorithm
(AMOEBA) (Aldstadt and Getis 2006), and from the toolset designing perspective, for exam-
ple R, ArcGIS, GeoDa (Anselin, Syabri, and Kho 2006), SaTScan (Kulldorff et al. 1997).
The preponderance of the literature on spatial point pattern analysis treats each point as an
event independent of all the others. Spatial flow data, however, encompass at least two points,
one corresponding to the origin or start of the flow and one for the destination or end of the flow.
Flow data, therefore, differ fundamentally from single point data and methods designed to handle
the latter cannot be directly applied to flow data. Several endeavors have been undertaken in pre-
vious research to fill this gap. Berglund and Karlstr
om (1999) applied the Gistatistics introduced
by Getis and Ord (1992) and Ord and Getis (1995) to identify local spatial association in flow
data. Although several different spatial weight matrices were proposed in this article to address
spatial non-stationarity, only the simplest binary spatial weight matrix based on identical origins
or destinations was implemented, which certainly limits its usage. Lu and Thill (2003) proposed
an ad hoc and partially qualitative approach in which they apply point cluster detection methods
to analyze origin and destination points respectively, and combine the two sets of results via a
relationship table to conclude on the patterns exhibited by the flows. Related issues such as sensi-
tivity to scale and neighborhood definition were discussed in their later work (Lu and Thill 2008).
While decomposing one-dimensional flows into zero-dimensional points can considerably sim-
plify the problem, this approach would inevitably overlook the simultaneity of some critical
Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection
3
information, such as flow direction and flow length. Murray et al. (2011) departed from this
approach by combining exploratory spatial data analysis and confirmatory circular statistics to
analyze the similarities of flow direction and length. However, they sacrifice the actual locational
information in the process so that little knowledge on spatial relationships between movements
can be extracted. More recently, Liu, Tong, and Liu (2015) extended both global and local Mor-
an’s I statistics to a flow context, considering movement distances and directions at once. None-
theless, their approach is still based on the spatial proximity relationship of either set of end
points rather than entire vectors. Therefore, we contend that it remains within the scope of meas-
uring spatial autocorrelation of vectors/flows in parts rather than as a whole. The method pro-
posed in this article departs radically from the existing literature by maintaining the integrity of
flow data. It not only fully considers flow characteristics, that is, end points, length, and direction,
but also builds on proper measurement of spatial proximity relationship between entire flows.
While this article mainly focuses on spatial statistical methods, contributions from other
perspectives are also worth considering. Various research contributions apply techniques of
data mining and geovisualization to investigate the properties of spatial flows. Tobler (1987)
suggested that selective information aggregation and removal is an effective strategy for identi-
fying patterns through visualization and he pioneered this idea to analyze migration flows with
computer-drawn maps. Benefiting from burgeoning computing capability and visualization per-
formance, many contributions have emerged to be both effective and efficient, especially for
large data sets. K-means algorithms have proved very effective with respect to multilocation
spatial data (Genolini and Falissard 2010; Ossama, Mokhtar, and El-Sharkawi 2011). Density-
based clustering methods have also been adjusted to the nature of flow data by summarizing
the distributions of origins and destinations (Nanni and Pedreschi 2006; Zhu and Guo 2014).
Geometry-based edge-bounding is another type of approaches to reduce the visual clutter
caused by extensive edge crossing in flow maps (Cui et al. 2008). To serve the same purpose,
Guo (2009) proposed a visualization framework to partition spatial interactions into their
“nature” regions and discover mixing patterns of flow networks. In general such visual analyti-
cal methods embrace the principle of data mining and analytical classification methods
designed to group observations into “clusters” based on similarity (Waller 2009); therefore,
they are also named “cluster analysis.” Given the overlap in terminology is really confusing, it
is necessary to differentiate these “cluster analysis” methods from the spatial statistical
approaches of cluster detection presented in this article. While we mainly focus on building
innovative spatial statistics here, it is potentially very meaningful to incorporate these methods
of exploratory analysis as a prior step to help propose hypotheses.
Methodology
The principle
In spatial analysis, cluster detection is an approach to second-order analysis that is designed to
examine spatial dependence, or spatial relationships between events (Getis and Franklin 1987).
The first step is to choose an appropriate measure of spatial proximity between events, for
which distance is a common choice. Ripley’s K-function, Geographic Analysis Machine, Near-
est Neighbor Index and many other statistical approaches are all distance-based methods. Aside
from the default Euclidean distance, other kinds of distance are also applied in some cases, for
instance the network distance (Yamada and Thill 2007). With spatial flow data, there is no nat-
ural mean to measure spatial proximity due to the multilocation nature of flow records and this
Geographical Analysis
4
is arguably the biggest difficulty in analyzing spatial patterns of flow data. In other words, with
appropriately measured spatial proximity, cluster detection on flows boils down to the same
algorithmic processes as for points or polygons. Although various distance measures have been
proposed in data mining studies of trajectory, for example using the Hausdorff distance to
extract clustered line segments of trajectories (Lee, Han, and Whang 2007; Chen et al. 2011),
we argue that these distances are not suitable to measure proximity between flows which have
explicit and meaningful location correspondence. Accordingly, we devise a new proximity
measure called the “Flow Distance” and a variant called the “Flow Dissimilarity.” Then we
extend a well-developed spatial point statistic, namely Ripley’s K-function, to the spatial flow
context based on the newly defined proximity measures. Statistical significance is tested by
Monte Carlo simulation against the null hypothesis of spatial randomness. Several aspects such
as the multiscalar relevance, relative importance control, and flow value, are discussed in detail
here to demonstrate that this method is versatile and practical.
Flow model
The first step is to define the study object, namely the spatial flow process. Fig. 1 shows two
instances of a spatial process Fthat starts at location Oand ends at location D. Basic character-
istics of Finclude length: l5j
~
ODj; direction: same as the direction of vector
~
OD; type: T(e.g.,
commuting flow); and value W(e.g., the number of commuters). This basic model is used to
represent spatial flow processes in the rest of the article.
Flow proximity
As mentioned earlier, defining an appropriate proximity measure is the key to decode spatial
flow patterns. Here we introduce such measures based on which both intrarelationships and
interrelationships of flows can be extracted.
Let us take the simple case of measuring the spatial proximity between flow F
i
(with origin
point O
i
(x
i,
y
i
) and destination point D
i
(u
i,
v
i
)) and flow F
j
(from point O
j
(x
j,
y
j
) to point D
j
(u
j,
v
j
)) in a two-dimensional space (Fig. 1). Measuring distance between these two spatial flows
following the approaches advocated so far in the literature would generally be inadequate
because distance between either origin points or destination points cannot fully represent the
closeness between flows in their entirety. For instance, when both origins are a short (or long)
distance to each other and the same can be said of destinations, we can expect that F
i
and F
j
are also close (or distant, respectively). However, things become less trivial when the two end-
point pairs show dissimilar spatial closeness, that is, origins are close while destinations are
distant, or vice versa. Using categorical descriptions is certainly one way to associate distances
among origins and destinations. For instance, both distances being short (or both endpoint pairs
belong to the same region) would correspond to “high” spatial association between flows while
only one pair of end points being close (or belonging to the same region) would correspond to
Figure 1. Basic flow model.
Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection
5
a “medium” degree of association (Berglund and Karlstr
om 1999; Lu and Thill 2003; Zhu and
Guo 2014). While such approaches make sense to some extent, they are very sensitive to the ad
hoc description standards and exhibit limited external validity.
Unlike approaches treating spatial flows as two separate sets of endpoints, we propose to
calculate a flow distance that regards flows as inseparable objects. A flow process F
i
with origin
point O
i
(x
i,
y
i
) and destination point D
i
(u
i,
v
i
) can be seen as a vector point with four coordi-
nates F
i
(x
i,
y
i,
u
i,
v
i
) in a four-dimensional space. Derived from the general function of Euclid-
ean distance, we define the Flow Distance between flows F
i
(x
i,
y
i,
u
i,
v
i
) and F
i
(x
j,
y
j,
u
j,
v
j
) as:
FDij5ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
axi2xjÞ21yi2yjÞ2
i
1b ui2ujÞ21vi2vjÞ2
i
:
hh
r
or simplify as :FDij 5ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
adO21bdD2
q:(1)
where FDij denotes the distance between these two flows; dOand dDare the Euclidean distan-
ces between the two origins and two destinations, respectively; the coefficients aand bserve
to control the relative importance of either sets of endpoints (a>0; b>0;a1b52; by
default a5b51). Through this definition, both the closeness of origins and of destinations
make a contribution to the calculation of the Flow Distance. For example in Fig. 2a,
FD125ffiffiffiffiffiffiffiffiffiffiffiffiffi
22122
p5ffiffi
8
p. The value of Flow Distance becomes larger (or smaller) if both end-
points are moved further (or closer) to their counterpart at the same time, for example, FD12
increases to ffiffiffiffiffi
18
pin Fig. 2b while it decreases to ffiffi
2
pin Fig. 2c. This corresponds to the general
sense that proximities of endpoints are positively correlated to the flow closeness.
More importantly, the distance between origins and the distance between destinations are
integrated by the same square root transformation so their variations are captured continuously
and consistently, which leads to greater accuracy than qualitative descriptors. For instance,
compared with Fig. 2a, Flow F2in Fig. 2d has its origin moved toward F1’s and has its destina-
tion moved away from F1’s. According to previous methods, whether these two flows in
Figure 2. Flow Distance Examples.
Geographical Analysis
6
Fig. 2d are as close as they are in Fig. 2a completely depend on the definition of endpoint’s
contiguity relationship. In other words, if two points are defined as contiguous when their dis-
tance is less than or equal to 2, F1and F2would have two contiguous endpoint pairs in Fig. 2a
but only one in Fig. 2d. As a result, the proximities between F1and F2are radically different.
In contrast, by our definition of Flow Distance, measuring proximity between two flows is
not subject to the definition of endpoint’s own region or the description of the combined end-
point’s closeness. Instead, we capture the variation of all locations seamlessly and let the
flow data decide its own spatial neighbors for itself. Accordingly, the distance between F1
and F2can be calculated and compared directly as FD12 equals ffiffi
8
pin both Fig. 2a and d
scenarios.
Nevertheless, only using the location information of endpoints may be inadequate some-
times because a flow does not only represent the interaction or movement between two loca-
tions, but also indicates how far and in what direction the interaction or movement happens. As
shown in Fig. 2e, two flows have exactly the same endpoint distances as Fig. 2a, therefore the
Flow Distances are the same according to equation (1). Regardless of the real data type they
represent, it would be controversial to say that the two flows in Fig. 2e are as close as the ones
in Fig. 2a given that they are separated much more, relative to their lengths. Controlling for the
impact of flow length may be necessary to avoid false positive detection of flow clusters. To
this end, we propose an extended version of Flow Distance that involves a rescaling, as pro-
vided by equation (2). By dividing by the geometric mean of two flow lengths, a flow pair with
longer average length would be measured closer, ceteris paribus. Therefore, the distance
between the short flows F1and F2in Fig. 2e becomes four times longer as the one in Fig. 2a.
The rationality behind this adjustment is that under many circumstances it is more difficult or
rarer to witness spatial interaction or movement happen between two distant locations than
close locations. For example wild animals are more likely to travel to a nearby river than a dis-
tant one to seek water. Incorporating flow length into the measure is one way to adjust the crite-
rion of clustering detection for flows with unequal lengths. Given the adjustment would impair
some of the metric properties of distance, we name the adjusted Flow Distance as Flow Dissim-
ilarity, short for FDS in the rest of this article. Also we choose to use the geometric mean over
the arithmetic mean of flow lengths because the former is more capable to attenuate the impact
of extremely unequal length values. In addition, it avoids the limit case of zero-length flows.
FDSij 5ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
a½ðxi2xjÞ21ðyi2yjÞ21b½ðui2ujÞ21ðvi2vjÞ2
LiLj
s:
or :FDSij5ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
adO21bdD2
LiLj
s:(2)
where FDSij denotes the Flow Dissimilarity between these two flows; Liand Ljare the flow
lengths; the rest are the same as equation (1).
Although considering flow length in spatial pattern detection can be very useful and some-
times necessary, we are not arguing that this is a better approach in all situations. Instead, we
believe that they both make sense under certain circumstances. Evidences can be found in liter-
ature that flow length was not discussed in some research (Berglund and Karlstr
om 1999; Lu
and Thill 2003, 2008; Zhu and Guo 2014), while it was taken into consideration in some others
(Murray et al. 2011; Liu, Tong, and Liu 2015). In this research experiments have been
Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection
7
conducted with both Flow Distance (equation [1]) and Flow Dissimilarity (equation [2]) for
comparison, and details are provided in the case study section below.
Besides endpoint locations and flow length, the only remaining spatial element of a flow is
its directionality. Although we do not directly measure directionality in equations (1) and (2),
its impact is implicitly accounted for. As illustrated in Fig. 2f, to maintain F2at the same dis-
tance from F1, according to our Flow Dissimilarity equation it is sufficient to keep its origin
and destination at a constant distance from F1’s two endpoints, that is, to keep its endpoints sit-
uated on circles centered on F1’s two endpoints (the dashed rings), for example, F
2. Given this
geometric constraint, there are in fact few degrees of freedom in directionality for flows that
exhibit a tendency toward clustering. Therefore we argue that it is not necessary to discuss flow
direction alone since it is heavily dependent on the endpoint locations and flow length. Our test
results have also demonstrated this argument by identifying clusters of similar-direction flows.
Last but not least, the coefficients (a;b) in the distance and dissimilarity functions are
designed to offer some flexibilities in measuring real flow data. The basic functions by default
(a5b51) assign equal importance to the origin location and destination location of each
flow. However, the research objectives may lead us to pay closer attention to one set of end-
points over the other. For instance, in a study of settlement of foreign immigrants in New York
City in relation to national origin, socio-spatial patterns and processes would be better
informed if more weight is put on where immigrants choose to reside rather than where they
come from. As another example, the manager of a shopping center would be more interested in
where customers come from so that more targeted and effective advertising strategies can be
designed. The inconsistent spatial scale of flow origins and destinations may be another justifi-
cation to rebalance the relative importance of origins and destinations in the Flow Distance and
Dissimilarity measures. For example, different land uses are known to be spatially distributed
differently across cities; in particular employment sites tend to be more clustered geographi-
cally than residential land uses. Therefore, to avoid a statistical bias, a spatial analysis of com-
muting flows should control for the spatial distribution of potential flow origins and
destinations. With appropriate calibration, the same distance (e.g., 500 meters) would have the
same impact on describing the proximity between two origin locations or between two destina-
tion locations.
By adjusting the values of aand b, the Flow Distance or Dissimilarity can receive differ-
ent contributions from origins and destinations. For example, if we assign a51.5 and b50.5,
the Flow Distance or Dissimilarity would be more sensitive to the change of origin locations
and the corresponding spatial pattern would put more weight on where flows start. In addition,
we restrict that a1b52 to ensure the results with different coefficients are comparable. They
both must also have positive value to match the reality of flow data sets rather than points.
Hot spot detection method
Using our Flow Distance (or Flow Dissimilarity) as the spatial proximity measure, it becomes
possible to apply well-developed distance-based methods to detect spatial clusters of flow data.
In this study we choose to adjust the local version of Ripley’s K-function. As a classical clus-
tering detection method, the K-function has been continuously implemented and enhanced
since it was redefined by Ripley in 1976 (Ripley 1976; Okabe, Boots, and Satoh 2007). The
fundamental idea of the K-function is to count the number of events within a certain distance
threshold of randomly selected event locations. This number is then used to calculate K-
Geographical Analysis
8
function value after dividing by the event density and the analysis is repeated for other distan-
ces within a set interval. To obtain statistical conclusions, the K-function value needs to be
compared with the expected value given by the null hypothesis, for example Complete Spatial
Randomness (CSR). If the observed value is higher than expected, the study events exhibit a
tendency toward clustering; or dispersed, if it is lower. Monte Carlo simulation is a frequently
applied technique to assess statistical significance (Openshaw et al. 1987). One of the meaning-
ful extensions of K-functions was introduced by Getis and Franklin (1987), based on second-
order neighborhood analysis of mapped point patterns, which has been known as local K-
function analysis. An extension of the local K-function (equation [4]) is applied in this research
to flow data using the four-dimensional approach introduced above. Instead of counting point
events, flow events are counted within a certain Flow Distance (or Flow Dissimilarity) rof
flow F
i
to represent the function value:
LocKirðÞ5E number of other flow events within r of flow iðÞ:(3)
where LocKirðÞis the local K-function value of flow F
i
at scale r. The scale r, also known as
the detection window radius or threshold distance, has always been a crucial factor in spatial
statistics, especially the K-function, which is even known as “multi-distance cluster analysis”.
In our approach we implement the local K-function at multiple scales as well. By increasing
the magnitude of scale rwithin a certain range deemed suitable to the process under study, for
example, from 0.1 mile to 1 mile when using Flow Distance or from 0.1 to 1.0 when using
Flow Dissimilarity, it is convenient to detect multiscale clustering patterns at once.
As with other spatial statistical methods, statistical inference is an important part of reach-
ing any conclusion. Given the nature of flow data, normal approximation is not an appropriate
null hypothesis (Lu and Thill 2003, 2008; Liu, Tong, and Liu 2015). Random permutations
with Monte Carlo simulation can better serve this purpose. In a two-dimensional space, there
are normally more than one way to simulate a set of flows. On the one hand, we can proceed
by setting the location of two endpoints for each simulated flow. Alternatively, we could use
observed flows as objects and move or rotate them in the study area according to some random-
ization procedure. Whatever the technique used, the theory or basic assumptions behind the
simulation must be fully spelled out.
The simplest way is to simulate two sets of points randomly and independently based on
Poisson distribution, and then pair and connect them as flows. However, the customary null
hypothesis for point data, that is, CSR, may not be the best option for flows. A more sensible
way is conditional spatial randomness, which has been used widely for computing the pseudo
P-value in spatial statistics (Anselin 1995). In terms of flow data, the “condition” should be
considered when the endpoints are restricted to the distribution of an at-risk population. For
instance, to simulate commuting flows according to residence distribution and workplace distri-
bution (Lu and Thill 2003); to simulate car accident points on the road network and adjust by
annual average daily traffic (Yamada and Thill 2010). In addition to endpoint locations, the dis-
tribution of flow length and flow direction can also be conditional. Liu, Tong, and Liu (2015)
simulate a set of flows by moving one flow to another randomly selected flow’s endpoint loca-
tion so that only flows’ locations are changed while the lengths and directions are kept the
same. They propose another way by randomly pairing two points, one from observed origins
and the other from observed destinations, to form simulated flows. This approach keeps end-
point locations the same but reshuffles the lengths and directions as opposed to the first
Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection
9
approach. In sum, there is no unique way to simulate spatial flows for significance testing. It is
subject to the data to make appropriate assumption (e.g., restricted to at risk population). In
addition, is up to the analyst to choose which aspect to examine (e.g., to examine the contribu-
tion of flow location to the general flow clustering pattern by only randomizing location while
fixing direction and length). Fundamentally cluster detection is an exploratory analysis. The
clusters identified can reflect the respective underlying geographical processes and can also
help us contemplate unknown ruling attributes contributing to the spatial pattern. The detailed
algorithm is presented step by step as follows.
Algorithm implementation
1. Calculate Flow Proximity
a. Prepare flow events as vectors with the coordinates of origin and destination points.
For example, flow Fiwith origin Oixi;yi
ðÞand destination Diui;vi
ðÞis formatted
as Fixi;yi;ui;vi
ðÞ:
b. Apply equation (1) or (2) to calculate the Flow Distance or Flow Dissimilarity
between every two flows. Thus an Nby Ndistance matrix is computed for subsequent
use.
2. Calculate clustering detection statistics.
Calculate the local K-function using equation (3) for all the flow events using a series
of scales rt(t51, 2, ..., 10; rt5r13t). The unit of r1is chosen on the proximity equation
used in previous step, for example, r150.1 mile along with equation (1); r150.1 along
with equation (2).
3. Evaluate statistical significance.
a. Randomly simulate a set of Nflows in the study area.
b. Calculate the local K-function value for each simulated flow same as step (1) and (2).
c. Repeat previous two steps 1,000 times.
d. Sort results of the 1,000-time simulations for each flow at each scale. Set the smallest
and largest ones as the lower and upper envelopes (0.1% significance level).
e. Compare the actual result with the corresponding significance envelopes. If the
observed value surpasses the upper envelop, or is below the lower envelope, the
observed pattern is said to be clustered or dispersed, respectively.
4. Visualize and discuss the results.
Experimental study
Data description
In this study, we test the new flow K-Function method and its algorithmic implementation
using a data set of vehicle theft and recovery location pairs in Charlotte, North Carolina. Given
the determinate relationship and chronological order of the data, the locations where theft hap-
pened and the places where the vehicles were recovered can be regarded as flow origins and
destinations, respectively. According to the crime report released by the Charlotte-
Mecklenburg Police Department (CMPD), there were 14,064 vehicle theft cases within the city
from 09/01/2008 to 08/31/2014. Of all these cases, 6,960 have correct corresponding recovery
locations somewhere else in the city. In the data cleaning process, we excluded the records
with identical theft and recovery locations to exclude the cases of attempted break-ins, damage
Geographical Analysis
10
to the vehicle, interrupted stealing, or other incomplete theft crimes. The final study data set
consists of 6,810 theft-recovery flow events. From the map shown as Fig. 3 we can observe the
distribution of these locations. Overall, both theft and recovery locations have similar distribu-
tion across the city: there is a concentration around the city center, except for the southern por-
tion, which is known to encompass more affluent neighborhoods.
To gain a more intuitive knowledge of the data we also estimated the kernel density
(KDE) for both sets of locations with a cell size of 400 square feet and bandwidth of 0.5 mile
(Fig. 4). The KDE maps indicate that many car thefts happened in the eastern and northern
areas near the city center, while a significant part of them were recovered in the northwestern
region, where Charlotte Douglas International Airport is located. However, based on point pat-
tern analysis only, we can hardly build connections between theft locations and corresponding
recovery locations. According to popular criminological theories of vehicle theft crimes, such
as rational choice theory and routine activity theory, most criminals have meticulously
designed their target places and destination places in advance based on their cost-benefit analy-
ses (Lu 2006). As the new trend indicates, more vehicles are stolen by criminal gangs for
money-making business rather than joy-riding (McGoey 2000). Thus it would be extremely
useful to discover the spatial patterns of how stolen vehicles are transported from their offense
place to their destination.
Following the complete algorithm given in the previous section, we implement our flow
clustering detection approach on these crime data step by step. The null hypothesis of flow dis-
tribution is that car thefts and recoveries can happen anywhere on the street network within the
Charlotte city limits. Therefore the 1,000 time Monte Carlo simulation is proceeded by
Figure 3. (a) Vehicle theft locations in Charlotte. (b) Vehicle recovery locations in
Charlotte.
Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection
11
randomly locating flows’ endpoints on the city’s street network. The reason to choose such
assumption is that we have little prior knowledge about motor vehicle theft crime to add more
restrictions to the distribution of car theft and recovery event locations, or to the flow lengths
and directions. Not imposing constraints on the spatial characteristics of flows in the simulation
process has the advantage of not excluding any possible contributions to the final cluster
results. Edge effects are corrected by reducing the analysis area by a distance equal to the larg-
est distance band used in the analysis (one mile in this case study). Only the flows with both
endpoints within this shrunk area are selected to computing the algorithm, while the back-
ground flow spatial process and the simulated flows remain within the original area. The imple-
mentation program is written in C/C11 and parallel computing technique OpenMP is also
applied to accelerate computation, especially the simulation part. Results are visualized via
software ArcMap 10.1 and jFlowMap (Boyandin, Bertini, Lalanne 2010).
Results and discussion
Fig. 5 shows the local flow clusters detected with our method at selected scales.
1
The flows on
the maps represent the local clusters detected by our new approach as significant at the 0.1%
level. Each flow has one end colored in red to denote the theft location and the other end in
green to show the recovery location. To avoid visual clutter, we aggregate nearby flow clusters
into the census block groups where their end points are situated.
The results are analyzed from two aspects. First, we compare the results obtained using the
same equation of flow proximity measure. The first three results use Flow Distance with scale
of different magnitudes, that is, 0.1, 0.2, and 0.3 of a mile. As the magnitude of the scale
Figure 4. (a) KDE estimation of theft locations. (b) Kernel density estimation of recovery
locations.
Geographical Analysis
12
increases, more flows are detected as local clusters. The same pattern can be found in the other
set of results using Flow Dissimilarity. The variance caused by scale magnitude is consistent
with the basic feature of the K-function that the spatial pattern is partly dependent upon the
size of the detection window. The increasing number of local flow clusters indicates that more
nearby flows are included to contribute to the local K-function value as the detection window
becomes larger. At the same time, the increase of scale does not have an equivalent impact on
the background distribution which represents our null hypothesis. It is because we simulate the
background distribution by randomly placing the flow events on the street network without fur-
ther specific control, for example, crime risk; therefore the simulated flows are distributed
more sparsely throughout the city. As a result, the increase of scale has a positive impact on the
number of local flow clusters that are detected. As in other K-function related research, choos-
ing the optimal magnitude of scale remains an open question. It is typically selected in relation
to how the results can make sense to explain context-dependent research questions. In this
case, Fig. 5f presents some interesting patterns about vehicle theft and recovery flows. Vehicles
Figure 5. Detected flow clusters using different flow proximity measures. (a), (b), (c) use
Flow Distance (equation [1]) with detection scale equal to 0.1 mile, 0.2 mile, and 0.3 mile,
respectively. (d), (e), (f) use Flow Dissimilarity (equation [2]) with detection scale equal to
0.03, 0.04, and 0.05 respectively.
Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection
13
stolen from the area in the Southwestern section of the city are usually found somewhere far
away and their transport directions vary considerably. In addition, there is another group of
clusters in the Southeast showing much shorter transport distances and with similar directions
toward the North. One possible reason is that for the vehicles stolen in the Southwest area there
are only a few “favorable” places nearby for criminals to dispose of them. Therefore these cars
are transported over a long distance to places like chop shops for selling or to places like the
airport. Routine criminals who steal from the Southeast area may find it much easier because
there are sites nearby in the North to dispose of the cars.
On the other hand, we can also compare the results using different types of flow proximity
measures, namely the Flow Distance and Flow Dissimilarity. Comparing the two series of
maps in the top and bottom parts of Fig. 5 for a similar number of local clusters, the most
obvious difference is the average length of clustered flows. The results using Flow Distance
contain many short flows, while the results using Flow Dissimilarity tend to indicate longer
flows as local clusters. Taking a closer look, we find that some flows—especially shorter
ones—within the same cluster identified using Flow Distance do not share many geographic
and geometric similarities with their neighboring flows, for example, quite different flow direc-
tions and flow lengths. In contrast, flows within the same cluster using Flow Dissimilarity tend
to be very similar to each other. The reason behind this difference is that, when flow length is
not considered in measuring flow proximity, short flows need not be as similar in endpoint
locations, length and direction to each other as longer ones to have the same flow distance.
Therefore, they are more readily detected as the locus of a significant cluster than long ones, all
other things being equal. It results in false positive detection since some flows are detected as
local clusters simply because they are short enough to be captured by the detection window.
On the contrary, local clusters identified with Flow Dissimilarity include flows with close
vehicle theft sites, close vehicle recovery sites, and similar movement directionality and distan-
ces. The pattern is consistent throughout the study region. Moreover, the results would be of
practical use to law enforcement agencies to detect routine gang-related crimes with locational
preference for stealing and selling/disposing of vehicles in the city. As a conclusion, we argue
that the algorithm using Flow Dissimilarity to measure flow proximity is less likely to lead to
false positive errors as it controls for one source of spurious cluster detection. Besides, it pro-
vides a meaningful alternative to the traditional distance scale in solving the instability or
inequality in cross-scale flow clustering detection.
So far we have only discussed experiments with the basic version of the flow proximity meas-
ures. Further usefulness of the measures can be explored by changing its parameter value. In both
equations (1) and (2), we specify two coefficients, that is, aand b, to control the relative impor-
tance of origins and destinations. The expectation is that changing the relative value of these coeffi-
cients can purposely create a tendency for alternative cluster detection results. To test this
hypothesis, we adjust our approach by changing the coefficient values in Flow Distance. We assign
a51:5 and b50:5forthefirstgroupanda50:5 and b51:5 for the second. The sum of the
coefficient values is controlled as 2, for the sake of the comparability of the results.
Fig. 6 includes two comparable result maps. Fig. 6a shows the clusters detected by the
Flow Dissimilarity with a51:5 and b50:5, while Fig. 6b shows the outcomes setting
a50:5 and b51:5, both using Flow Dissimilarity measure with a scale equal to 0.04. Compar-
ing these two maps and also comparing them with Fig. 5d for which a5b51 by default, we
find that Fig. 6a contains more unique clusters with very close theft locations (red end) but rela-
tively distant recovery locations (green end), while Fig. 6b tends to show the opposite pattern.
Geographical Analysis
14
In other words, flows with close theft locations are easy to be detected as clusters in Fig. 6a and
flows with close recovery locations are favored in Fig. 6b. These observations are in line with
our premise that changing the value of Flow Distance coefficients can lead to results with dif-
ferent emphases, which can cater to people with different interests. In terms of practical useful-
ness, citizens would be more interested in looking at Fig. 6a which can inform where vehicle-
theft crimes are more likely to happen so that they can avoid parking in these highly risky pla-
ces. On the contrary, police would find Fig. 6b more useful in order to know where the concen-
trations of car-disposal places are and where they should search for the lost vehicles. By
comparing the result maps with Google Maps we found that the neighborhoods surrounding the
main campus of UNC Charlotte correspond to the cluster of theft sites in the northeastern part
of Fig. 6a, which indicates that this area is a popular car theft locus. Some clusters of recovery
places near the city center in Fig. 6b match the locations of savage vehicle yards or chop shops,
where stolen cars can be quickly transacted with cash and be sold again in parts.
Conclusions
Spatial statistical approaches to clustering detection have been continuously developed for dec-
ades. In contrast with abundant methods designed for point and polygon data, approaches well
suited to handling spatial flow data have not been well developed so far. To fill this gap and
also to meet the challenges brought by the emerging breadth of massive flow data, this research
has developed an innovative spatial statistical method for flows. A pair of particular spatial
proximity measures called the Flow Distance and Flow Dissimilarity have been designed.
Based on these measures the local version of the K-function is adjusted and implemented to
Figure 6. Flow clusters with different endpoint emphases. (a) Clusters more focused on theft
locations (a51:5;b50:5). (b) Clusters more focused on recovery locations (a50:5;b51:5).
Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection
15
examine the second-order effects of spatial flows. By comparing the observed local K-function
value with the statistical confidence envelops generated via Monte Carlo simulation, the local
clustering pattern of each flow event can be identified at a certain statistical significance level.
The new method is an intuitive extension of the principles embedded in the K-function for one-
dimensional point events and is applicable to all types of flow data.
To test the effectiveness and usefulness of our method, a series of experiments have been
implemented using a real data set of vehicle theft-recovery ows in Charlotte, NC. The results
demonstrate that our method is capable of identifying local clusters from the several thousands of
tangled flows. Specifically, the measures we designed proved not only to be measures of spatial
proximity, but an effective solution for the inclusion of the multilocation interaction objects
within the scope of well-developed point pattern spatial statistics, namely the local K-function.
By adjusting the parameters of endpoint coordinate pairs, the study emphasis can be purposely
placed on the spatial associations between either flow origins or flow destinations. In addition,
the impact of flow length has also been thoroughly discussed. To overcome the statistical bias
brought by ow lengths, we introduced a variant of Flow Distance called Flow Dissimilarity.
The experiment shows that the algorithm using Flow Dissimilarity leads to more stable spatial
patterns and is adaptive to flows with varied lengths across the study region. Overall, the method
designed in this research has fully utilized the spatial characteristics of flow data, and it is demon-
strated to be capable of investigating spatial associations of flow events across scales. The results
examined with this method have practical implications as well. In this vehicle-theft crime exam-
ple, it can inform not only where frequent car theft and recovery happen, but how the stolen cars
are moved from one place to another in the form of spatial flow clusters. The results are espe-
cially useful to devise effective police responses to routine gang crime activities.
The proposed analytic method can be extended in several ways. First, further work can be
done to expand the capability of this method to include additional event characteristics, for
example considering flow type and value in “hot flow” detection. A plausible idea is to use the
local cross K-function (Boots and Okabe 2007) instead of the traditional local K-function to
detect clusters of flows with different types, for example, rescue goods flow spatially associated
with refugee flow; and to accumulate the total value of nearby flows instead of simply tallying
their frequency in calculating the local K-function so as to adjust the contribution of flows with
unequal value, for example, a one-thousand-people commuting flow versus a single-person
commuting flow. Also, we believe that the Flow Distance and Flow Dissimilarity measures can
be shown to be effective with other methods of exploratory spatial data analysis including the
local Moran’s I and G statistics for flow data analysis. Furthermore, we envision that the princi-
ples of the flow proximity measure can be further expanded to higher dimensionality for the
space-time analysis of flow data, or to other kinds of spatial analyses, for example spatial inter-
action modeling and trajectory data analysis. Lastly, combining this spatial statistical method
with other fast-developing techniques is also very meaningful. GeoComputation, GeoVisuali-
zation, and spatial data mining are all powerful methods that complement confirmatory statisti-
cal analysis, especially in this “Big Data” era.
Note
1 The observed global K-function for this dataset is above the 0.01 upper envelope at most scales. To bet-
ter demonstrate the capability of our new local flow clustering statistics, we report results for selected
scales within the range of statistical significance.
Geographical Analysis
16
References
Aldstadt, J., and A. Getis. (2006). “Using AMOEBA to Create a Spatial Weights Matrix and Identify Spa-
tial Clusters.” Geographical Analysis 38(4), 327–43.
Anselin, L. (1995). “Local Indicators of Spatial Association–LISA.” Geographical Analysis 27(2),
93–115.
Anselin, L., I. Y. Syabri, and Kho. (2006). “GeoDa: An Introduction to Spatial Data Analysis.” Geograph-
ical Analysis 38(1), 5–22.
Berglund, S., and A. Karlstr
om (1999). “Identifying Local Spatial Association in Flow Data.” Journal of
Geographical Systems 1(3), 219–36.
Besag, J., and J. Newell. (1991). “The Detection of Clusters in Rare Diseases.” Journal of the Royal Sta-
tistical Society Series A 154(1), 143–55.
Boots, B., and Okabe, A. (2007). “Local Statistical Spatial Analysis: Inventory and Prospect.” Interna-
tional Journal of Geographical Information Science 21(4), 355–75.
Boyandin, I., E. Bertini, and D. Lalanne. (2010). “Using Flow Maps to Explore Migrations over Time.” In
Geospatial Visual Analytics Workshop in Conjunction with The 13th AGILE International Confer-
ence on Geographic Information Science. Guimar~
aes, Portugal, 2(3).
Chen, J., R. Wang, L. Liu, and J. Song. (2011). “Clustering of Trajectories Based on Hausdorff Distance.”
2011 International Conference on Electronics, Communications and Control (ICECC), Ningbo,
China, 1940–44.
Cressie, N. (1993). Statistics for Spatial Data. New York: Wiley.
Cui, W., H. Zhou, H. Qu, P. C. Wong, and X. Li. (2008). “Geometry-Based Edge Clustering for Graph
Visualization.” IEEE Transactions on Visualization and Computer Graphics 14(6), 1277–84.
Diggle, P. (1983). Statistical Analysis of Spatial Point Patterns. London: Academic Press.
Fortin, M., and Dale, M. (2009). “Spatial Autocorrelation.” In The SAGE Handbook of Spatial Analysis,
89–103, edited by S. Fotheringham and P. Rogerson. London: Sage
Fotheringham, S. (1997). “Trends in Quantitative Methods I: Stressing the Local.” Progress in Human
Geography 21(1), 88–96.
Fotheringham, S., and B. Zhan. (1996). “A Comparison of Three Exploratory Methods for Cluster Detec-
tion in Spatial Point Patterns.” Geographical Analysis 28(3), 200–18.
Geary, R. (1954). “The Contiguity Ratio and Statistical Mapping.” The Incorporated Statistician (The
Incorporated Statistician) 5(3), 115–45.
Genolini, C., and B. Falissard. (2010). “KmL: K-Means for Longitudinal Data.” Computational Statistics
25(2), 317–28.
Getis, A., and J. Franklin. (1987). “Second-Order Neighborhood Analysis of Mapped Point Patterns.”
Ecology 68, 473–77.
Getis, A., and J. Ord. (1992). “The Analysis of Spatial Association by Use of Distance Statistics.” Geo-
graphical Analysis 24(3), 189–206.
Guo, D. (2009). “Flow Mapping and Multivariate Visualization of Large Spatial Interaction Data.” IEEE
Transactions on Visualization and Computer Graphics 15(6), 1041–48.
Kulldorff, M. (1997). “A Spatial Scan Statistic.” Communications in Statistics - Theory and Methods
26(6), 1481–96.
Lee, J. G., J. Han, and K. Y. Whang. (2007). “Trajectory Clustering: A Partition-and-Group Framework.” In
Proceedings of the 2007 ACM SIGMOD international conference on Management of data.Beijing,
China 593–604.
Liu, Y., D. Tong, and X. Liu. (2015). “Measuring Spatial Autocorrelation of Vectors.” Geographical
Analysis. 47(3), 300–319.
Lu, Y. (2006). “Spatial Choice of Auto Thefts in an Urban Environment.” Security Journal 19 (3),
143–166.
Lu, Y., and J.-C. Thill. (2003). “Assessing the Cluster Correspondence between Paired Point Locations.”
Geographical Analysis 35(4), 290–309.
Lu, Y., and J.-C. Thill. (2008). “Cross-scale Analysis of Cluster Correspondence Using Different Opera-
tional Neighborhoods.” Journal of Geographical Systems 10(3), 241–61.
McGoey, C. (2000). “Auto Theft Facts.” www.crimedoctor.com/autotheft1.htm
Ran Tao and Jean-Claude Thill Spatial Flow Cluster Detection
17
Moran, P. (1950). “Notes on Continuous Stochastic Phenomena.” Biometrika 37(1), 17–23.
Murray, A., Y. Liu, S. J. Rey, and L. Anselin (2011). “Exploring Movement Object Patterns.” The Annals
of Regional Science 49(2), 471–84.
Nanni, M., and Pedreschi, D. (2006). “Time-Focused Clustering of Trajectories of Moving Objects.”
Journal of Intelligent Information Systems 27(3), 267–289.
Okabe, A., B. Boots, and T. Satoh. (2010). “A Class of Local and Global K-functions and Their Exact Sta-
tistical Methods.” Perspectives on Spatial Data Analysis. 101–12. edited by L. Anselin and S. J. Rey.
Berlin, Heidelberg: Springer.
Openshaw, S., M. Charlton, C. Wymer, and A. Craft. (1987). “A Mark 1 Geographical Analysis Machine
for the Automated Analysis of Point Data Sets.” International Journal of Geographical Information
Systems 1(4), 335–58.
Ord, J., and A. Getis. (1995). “Local Spatial Autocorrelation Statistics: Distributional Issues and an
Application.” Geographical Analysis 27(4), 286–306.
Ossama, O., H. Mokhtar, and M. El-Sharkawi (2011). “Clustering Moving Objects Using Segments
Slopes.” International Journal of Database Management Systems 3(1), 35–48.
Ripley, B. D. (1976). “The Second-Order Analysis of Stationary Point Processes.” Journal of Applied
Probability 13, 255–66.
Symanzik, J. 2014. “Exploratory Spatial Data Analysis.” In Handbook of Regional Science, 1295–310,
edited by F. Manfred and N. Peter. Heidelberg, Germany: Springer.
Tobler, W. R. (1987). “Experiments in Migration Mapping by Computer.” The American Cartographer
14, 155–63.
Waller, L. (2009). “Detection of Clustering in Spatial Data.” In The SAGE Handbook of Spatial Analysis,
159–81, edited by S. Fotheringham and P. Rogerson. London: Sage.
Yamada, I., and J.-C. Thill. (2007) “Local Indicators of Network-Constrained Clusters in Spatial Point
Patterns.” Geographical Analysis 39(3), 268–92.
Yamada, I., and J.-C. Thill. (2010). “Local Indicators of Network-Constrained Clusters in Spatial Patterns
Represented by a Link Attribute.” Annals of the Association of American Geographers 100(2),
269–85.
Zhu, X., and D. Guo. (2014). “Mapping Large Spatial Flow Data with Hierarchical Clustering.” Transac-
tions in GIS 18 (3), 421–35.
Geographical Analysis
18
... Relatedness refers to the spatial proximity of origins or destinations, while heterogeneity denotes the randomness of the flow's origins and destinations. Metrics for assessing flow relatedness include Moran's I statistic for flows (Liu et al. 2015); and flow heterogeneity is indicated by metrics, such as the K-function (Tao and Thill 2016), local K-function (Berglund and Karlstrom 1999), and L-function (Shu et al. 2021). Both lines of research are intimately associated with cluster analysis of geographic flows. ...
... Existing flow-clustering metrics are primarily divided into the following categories. The first and most commonly used flow-distance measure was proposed by Tao and Thill (2016). Here, the flow is defined as a four-dimensional object expressed as (x i , y i , u i and v i ), where x i and y i represent the latitude and longitude of point O and u i and v i represent the latitude and longitude of point D. The distance between two flows was then calculated (formula (1)), with a and b as constants (a > 0, b > 0, a þ b ¼ 2; by default, Figure 2. Workflow of the spatio-temporal pattern analysis of Mobike data. ...
... By contrast, the geographic flow spectral clustering method proposed herein, which is grounded in graph theory and matrix eigenvalues, concentrates on the decomposition and clustering of the overall data structure, making comparisons with these methods inappropriate. Consequently, flowHDBSCAN (Tao et al. 2017), OPTICS (Fang et al. 2021) and the flow dissimilarity clustering method proposed by Tao and Thill (2016) were selected for comparison with the proposed geographic flow spectral clustering method. As illustrated by the red box in Figure 15(b), the optimal cluster count for the 'flow_Dissimilarity' method was six, whereas for flowHDBSCAN, as determined in Section 4.1, it was 57. ...
Article
Full-text available
Geographic flow clustering analysis can effectively reveal human behavioral patterns in movement. Traditional methods for studying human movement patterns are mostly based on first-order quantity analyses of point data, such as hotspots, density or clustering. Currently, relatively few second-order spatial analysis methods based on geographic flows exist. Thus, we developed a new geographic flow method based on spectral clustering and applied it to trajectory data analysis. This article uses the bike-sharing trajectories data in Shanghai in August 2016, spectral clustering analysis was conducted on the group flow patterns before, during and after rainfall, on weekdays and weekends and in the morning and evening peak. Spectral clustering was verified to exhibit better clustering effect by comparing the clustering indices of different clustering methods. This study enriches the analysis method of geographical flows, and the human mobility patterns revealed by its analysis can provide references for formulating urban green travel policies.
... This aggregation approach is valid in reducing the flow cluttering problem, but it ignores the flow patterns at local scales (Zhu and Guo 2014;Zhu et al. 2019). In recent years, several flow clustering methods have been developed in an attempt to extract flow clusters from large flow data Gao et al. 2018;Tao and Thill 2016). These clustering methods mitigate the cluttering and overlapping issues by extracting clusters of similar trips, while maximizing the spatial resolution of the data (Song et al. 2019;Zhu and Guo 2014). ...
... For instance, Liu et al. (2015) improved the global and local Moran's I statistics to extract flow clusters containing highly spatially correlated trips, and conducted an empirical study using taxi data from Shanghai, China as a case study. Tao and Thill (2016) proposed a K-function extension method for OD flow data to upgrade its detection target from point clusters to flow clusters. In addition, Gao et al. (2018) introduced a multidimensional spatial scan statistics approach to identify flow clusters. ...
... Given that some density-based clustering algorithms (e.g., DBSCAN (Ester et al. 1996) and OPTICS (Ankerst et al. 1999)) are well able to identify irregularly shaped point clusters, researchers have successfully upgraded point clustering to flow clustering by improving these traditional algorithms (Gallego et al. 2018;Tao and Thill 2016). Although such density-based clustering approaches have competitive advantages in detecting arbitrarily shaped clusters, they exhibit poor clustering performance when the density of OD flows is unevenly distributed (Reddy and Bindu 2017). ...
Article
Full-text available
Extracting flow clusters consisting of many similar origin–destination (OD) trips is essential to uncover the spatio-temporal interactions and mobility patterns in the free-floating bike sharing (FFBS) system. However, due to occlusion and display clutter issues, efforts to identify inhomogeneous flow clusters from large journey data have been hampered to some extent. In this study, we present a two-stage flow clustering method, which integrates the Leiden community detection algorithm and the shared nearest-neighbor-based flow (SNN_flow) clustering method to efficiently identify flow clusters with arbitrary shapes and uneven densities. The applicability and performance of the method in detecting flow clusters are investigated empirically using the FFBS system of Nanjing, China as a case study. Some interesting findings can be drawn from the spatio-temporal patterns. For instance, the share of flow clusters used to meet the “first-/last-mile” demand at metro stations is reasonably high, both during the morning (71.85%) and evening (65.79%) peaks. Compared with the “first-/last-mile” flow clusters between metro stations and adjacent workplaces, the solution of the “first-/last-mile” flow clusters between metro stations and adjacent residences is more dependent on the FFBS system. In addition, we explored the shape and density distribution of flow clusters from the perspective of origin and destination points. The endpoint distribution characteristics demonstrate that the shape distribution of metro station point clusters is generally flatter and the spatial points within them are more concentrated than other sorts of point clusters. Our findings could help to better understand human movement patterns and home-work commute, thereby providing more rational and targeted decisions for allocating FFBS infrastructure resources.
... With respect to discrete flows, a flow measurement was proposed to evaluate the global and local spatial autocorrelation of flows, which considers the flow direction and magnitude (Liu, Tong, and Liu, 2015). Ripley's K-functions (Ripley, 1976) for flows were developed to detect spatial clusters in planar space (Tao and Thill, 2016) and network space (Kan, Kwan, and Tang, 2021), and the flow cross K-function was proposed to evaluate the spatial dependence of two types of discrete flows (Tao and Thill, 2019). The L-function (normalization of the K-function) in flow space was proposed to reveal the maximal aggregation scale of flows (Shu et al., 2020), and the length-squared L-function was developed to identify the clustering patterns of flows in network space . ...
... For spatial distance measurement, both the spatial relationships between flow origins and between flow destinations should be considered (Yan et al., 2023). Referring to the method of combining the origin and destination distances for discrete flows (Tao and Thill, 2016;Shu et al., 2020), the additive distance and maximum distance for aggregated flows based on topological or Euclidean distances can be defined. Equations (2) and (3) use the additive distance as an example. ...
Article
Flows can reflect the spatiotemporal interactions or movements of geographical objects between different locations. Measuring the spatiotemporal autocorrelation of flows can help determine the overall spatiotemporal trends and local patterns. However, quantitative indicators of flows used to measure spatiotemporal autocorrelation both globally and locally are still rare. Therefore, we propose the global and local flow spatiotemporal Moran's I (FSTI). The global FSTI is used to assess the overall spatiotemporal autocorrelation degree of flows, and the local FSTI is applied to identify local spatiotemporal clusters and outliers. In the FSTI, to reflect flow spatiotemporal adjacency relationships, we establish flow spatiotemporal weights by multiplying the spatial and temporal weights of flows considering spatiotemporal orthogonality. The flow spatial weights include contiguity‐based (considering first/higher‐order and common border) and Euclidean distance‐based weights. The temporal weights consider ordinary and lagged cases. As flow attributes may follow a long‐tail distribution, we conduct Monte Carlo simulations to evaluate the statistical significance of the results. We assess the FSTI using synthetic datasets and Chinese population mobility datasets, and compare some results with those of recent flow‐related methods. Additionally, we perform a sensitivity analysis to select a suitable temporal threshold. The results show that the FSTI can be used to effectively detect spatiotemporal variations in the autocorrelation degree and type.
... Such ESDA methods take a bottom-up strategy to discover interesting patterns from (usually big) flow datasets; they may lead to the formulation of hypotheses based on observations, and eventually may contribute to the development of new theories. Existing methods mainly focus on spatial properties of flows, such as flow distribution patterns detection (Cai & Kwan, 2022;Kan, Kwan, & Tang, 2021;Shu et al., 2021;Tao & Thill, 2016a;Tao & Thill, 2019a); density-based flow clustering (Liu, Yang, Deng, Song, & Liu, 2022;Shu et al., 2022;Tao & Thill, 2016b;Zhu, Guo, Koylu, & Chen, 2019); hierarchical-based flow clustering (Tao, Thill, Depken, & Kashiha, 2017;Zhu & Guo, 2014); statistical-based flow clustering (Liu, Yang, Deng, Liu, & Xu, 2022;Song et al., 2019;Tao & Thill, 2019b). Given the increasing availability of flow data with fine spatiotemporal resolution and high attribute dimensionality, datadriven and computationally powerful ESDA methods can be expected to feature prominently in the toolbox for the exploration of big spatial flow data in various contexts. ...
... For instance, f b1 can be considered more distant from f a than f b2 or f b3 , thus being assigned a smaller spatial weight. Besides, w ij,uv can be calculated by other non-contiguity-based flow neighborhood definitions, e.g., being inversely proportional to the flow distance (Tao & Thill, 2016a). Unlike the original LISA based on Moran's I, its spatial flow extension cannot infer statistical significance based on presumed data distribution. ...
Article
Spatial flow data represent meaningful spatial interaction (SI) phenomena between geographic regions that are often highly dynamic. However, most existing flow analytical methods are cross-sectional, and there is a lack of methods to measure spatiotemporal autocorrelation of flow data. To fill this gap, we proposed a new localized spatial statistical method called Space-Time Flow LISA. The method design is a combination of two existing method families, namely space-time LISA and Spatial Flow LISA. A critical component of the method is the space-time weight matrix of flow data that blends pairwise spatial and temporal connectivities. We design three versions of the matrix, namely contemporaneous, lagged, and hybrid. We evaluate the method using both synthetic data and a case study of U.S. interstate migration from 2005 to 2017. The method is found to have high efficacy in finding spatiotemporal local autocorrelation patterns. Unlike the Spatial Flow LISA that tends to detect short-distance migration corridor havens (‘HH’ flows) and long-distance migration corridor deserts (‘LL’ flows), the Space-Time Flow LISA is less impeded by the distance between flow origin and destination, as they can pick up local patterns that are less spatially explicit but temporally dependent. In addition, the new method is able to detect time-sensitive patterns such as the outmigration from Louisiana forced by Hurricane Katrina in 2005. By integrating spatial, temporal, and attributive associations into a one-step analysis, the proposed Space-Time Flow LISA can illustrate the spatiotemporal structure of flow phenomena well, and reveal dynamic distribution changes over time.
... However, their flow unit merging method was flawed, compromising result accuracy. Tao and Thill (2016) used spatial statistical methods to construct an empirical spatial flow weight matrix, identifying anomalous interaction regions as clusters of very high or low flow values (Tao and Thill 2016). Although still relying on flow unit counts, their key contribution was achieving significant flow patterns through spatial statistics. ...
... However, their flow unit merging method was flawed, compromising result accuracy. Tao and Thill (2016) used spatial statistical methods to construct an empirical spatial flow weight matrix, identifying anomalous interaction regions as clusters of very high or low flow values (Tao and Thill 2016). Although still relying on flow unit counts, their key contribution was achieving significant flow patterns through spatial statistics. ...
Article
Full-text available
One of the most crucial topics in spatial interaction studies is mining patterns from extensive origin-destination (OD) flow data to capture interregional associations. However, prevailing methodologies tend to disregard the importance of using the relative closeness of interregional connections as weights, treat spatial and temporal dimensions independently, or overlook the temporal dimension completely. Consequently, the identified patterns are susceptible to inaccuracies, and the precise identification of pattern occurrence time and duration, despite their fundamental importance, remains elusive. In light of these challenges, this study proposes a strategy to calculate and combine the strength of weighted spatiotemporal flows, and develops a clustering method and evaluation metrics based on this framework. Compared to alternative density-based methods, the strength-based calculation approach demonstrates a capacity to identify flow patterns characterized by relatively high interregional closeness. Thus, the identification of flow patterns expands beyond density-based approaches, encompassing strength-based considerations and a shift from absolute to relative closeness between regions. Experiments using synthetic datasets conducted in this research demonstrate the effectiveness, efficiency, and extraction accuracy of the proposed method. Furthermore, a case study using real Chinese population migration data demonstrates the efficacy of the method in revealing implicit spatiotemporal association patterns between regions. The present study implements an interaction strength-based flow clustering and evaluation method that considers spatiotemporal continuity, making it applicable to spatial flow data analysis involving interaction volume and time attributes. As a result, this method holds promise for facilitating the modeling of intricate spatial flows within various contexts of study.
... Moran's I statistic (Moran 1950) was extended to measure the spatial autocorrelation of spatial flows (Liu et al. 2015), and Anselin's local indicators of spatial association (LISA) (Anselin 1995) was extended to quantify the local spatial autocorrelation of spatial flows (Liu et al. 2015). The flow K-function was introduced to detect spatial flow clusters by extending the classic K-function to the spatial flow context (Tao and Thill 2016). The flow K-function can only be applied with discrete flows. ...
Article
Full-text available
Spatial flows represent spatial interactions or movements. Mining colocation patterns of different types of flows may uncover the spatial dependences and associations among flows. Previous studies proposed a flow colocation pattern mining method and established a significance test under the null hypothesis of independence for the results. In fact, the definition of the null hypothesis is crucial in significance testing. Choosing an inappropriate null hypothesis may lead to misunderstandings about the spatial interactions between flows. In practice, the overall distribution patterns of different types of flows may be clustered. In these cases, the null hypothesis of independence will result in unconvincing results. Thus, considering the overall spatial pattern of flows, in this study, we changed the null hypothesis to random labeling to establish the statistical significance of flow colocation patterns. Furthermore, we compared and analyzed the impacts of different null hypotheses on flow colocation pattern mining through synthetic data tests with different preset patterns and situations. Additionally, we used empirical data from ride-hailing trips to show the practicality of the method.
... (3) Flow cluster analysis: Flow clustering analysis of unsupervised learning was conducted according to the marked coordinate point information of different beekeepers' migratory flows. Considering that streams can be represented by directional line segments, three principles were considered when measuring the spatial similarity between flows [23]: (a) flows are spatially close to each other, (b) flow directions are approximately equal, and (c) the flow lengths are similar. Referring to the method introduced by Tao and Thill [24], which uses spatial statistical methods to detect clustering in flow data, a schematic diagram of the spatial position between two beekeeper-to-ground flows is shown in Figure 1. ...
Article
Full-text available
Apiculture is an important industry closely related to the national economy and people’s livelihoods. Beekeepers’ behavior is an important factor affecting the yield, quality, and benefits of apiculture. However, there is a lack of a systematic understanding of the long-term changes in beekeeping decisions made by beekeepers. Using panel data, we analyzed the dynamic trends and related influencing factors of decisions made by beekeeping models, honey source plant selection, and the migration flow space of beekeepers from 2009 to 2020. The results showed that the proportion of the LMB model decreased, while the PAB and SMB models continued to increase, the frequency of utilization of the main nectar source plants for honey collection decreased, and the concentration of migratory flow of beekeeping increased. Behavior of beekeepers from 2009 to 2020 showed a certain degree of spatial contraction, which seriously restricted the effective use of nectar plant resources. Family attributes, economic status, beekeeping models, and disaster conditions directly or indirectly affected beekeepers’ decisions. We propose a series of recommendations to facilitate the transformation and advancement of the Chinese bee industry. This study promotes an understanding of sustainable development of the bee industry in China and other countries worldwide.
Article
Full-text available
The network-constrained flow is defined as the movement between two locations along road networks, such as the residence-workplace flow of city dwellers. Among flow patterns, clustering (i.e. the origins and destinations are aggregated simultaneously) is one of the cities’ most common and vital patterns, which assists in uncovering fundamental mobility trends. The existing methods for detecting the clustering pattern of network-constrained flows do not consider the impact of road network topology complexity on detection results. They may underestimate the flow clustering between networks of simple topology (roads with simpler shapes and fewer links, e.g. straight roads) but with high network intensity (i.e. flow number per network flow space), and determining the actual scale of an observed pattern remains challenging. This study develops a novel method, the length-squared L-function, to identify clustering patterns of network-constrained flows. We first use the L-function and its derivative to examine the clustering scales. Further, we calculate the local L-function to ascertain the locations of the clustering patterns. In synthetic and practical experiments, our method can identify flow clustering patterns of high intensities and reveal the residents’ main travel behavior trends with taxi OD flows, providing more reasonable suggestions for urban planning.
Article
Full-text available
It is challenging to map large spatial flow data due to the problem of occlusion and cluttered display, where hundreds of thousands of flows overlap and intersect each other. Existing flow mapping approaches often aggregate flows using predetermined high-level geographic units (e.g. states) or bundling partial flow lines that are close in space, both of which cause a significant loss or distortion of information and may miss major patterns. In this research, we developed a flow clustering method that extracts clusters of similar flows to avoid the cluttering problem, reveal abstracted flow patterns, and meanwhile preserves data resolution as much as possible. Specifically, our method extends the traditional hierarchical clustering method to aggregate and map large flow data. The new method considers both origins and destinations in determining the similarity of two flows, which ensures that a flow cluster represents flows from similar origins to similar destinations and thus minimizes information loss during aggregation. With the spatial index and search algorithm, the new method is scalable to large flow data sets. As a hierarchical method, it generalizes flows to different hierarchical levels and has the potential to support multi-resolution flow mapping. Different distance definitions can be incorporated to adapt to uneven spatial distribution of flows and detect flow clusters of different densities. To assess the quality and fidelity of flow clusters and flow maps, we carry out a case study to analyze a data set of 243,850 taxi trips within an urban area.
Article
Full-text available
The scan statistic is commonly used to test if a one dimensional point process is purely random, or if any clusters can be detected. Here it is simultaneously extended in three directions:(i) a spatial scan statistic for the detection of clusters in a multi-dimensional point process is proposed, (ii) the area of the scanning window is allowed to vary, and (iii) the baseline process may be any inhomogeneous Poisson process or Bernoulli process with intensity pro-portional to some known function. The main interest is in detecting clusters not explained by the baseline process. These methods are illustrated on an epidemiological data set, but there are other potential areas of application as well.
Article
Introduced in this paper is a family of statistics, G, that can be used as a measure of spatial association in a number of circumstances. The basic statistic is derived, its properties are identified, and its advantages explained. Several of the G statistics make it possible to evaluate the spatial association of a variable within a specified distance of a single point. A comparison is made between a general G statistic andMoran’s I for similar hypothetical and empirical conditions. The empiricalwork includes studies of sudden infant death syndrome by county in North Carolina and dwelling unit prices in metropolitan San Diego by zip-code districts. Results indicate that G statistics should be used in conjunction with I in order to identify characteristics of patterns not revealed by the I statistic alone and, specifically, the Gi and G∗ i statistics enable us to detect local “pockets” of dependence that may not show up when using global statistics.
Article
This paper provides a rigorous foundation for the second-order analysis of stationary point processes on general spaces. It illuminates the results of Bartlett on spatial point processes, and covers the point processes of stochastic geometry, including the line and hyperplane processes of Davidson and Krickeberg. The main tool is the decomposition of moment measures pioneered by Krickeberg and Vere-Jones. Finally some practical aspects of the analysis of point processes are discussed.
Article
Some complex geographic events are associated with multiple point locations. Such events include, but are not limited to, those describing linkages between and among places. The term multi-location event is used in the paper to refer to these geographical phenomena. Through formalization of the multi-location event problem, this paper situates the analysis of multi-location events within the broad context of point pattern analysis techniques. Two alternative approaches (vector autocorrelation analysis and cluster correspondence analysis) to the spatial dependence of paired-location events (i.e., two-location events) are explored, with a discussion of their appropriateness to general multi-location event problems. The research proposes a framework of cluster correspondence analysis for the detection of local non-stationarities in the spatial process generating multi-location events. A new algorithm for local analysis of cluster correspondence is proposed. It is implemented on a large-scale dataset of vehicle theft and recovery location pairs in Buffalo, New York.
Chapter
In this chapter, we discuss key concepts for exploratory spatial data analysis (ESDA). We start with its close relationship to exploratory data analysis (EDA) and introduce different types of spatial data. Then, we discuss how to explore spatial data via different types of maps and via linking and brushing. A key technique for ESDA is local indicators of spatial association (LISA). ESDA needs to be supported by software. We discuss two main lines of software developments: GIS-based solutions and stand-alone solutions.
Article
This article introduces measures to quantify spatial autocorrelation for vectors. In contrast to scalar variables, spatial autocorrelation for vectors involves an assessment of both direction and magnitude in space. Extending conventional approaches, measures of global and local spatial associations for vectors are proposed, and the associated statistical properties and significance testing are discussed. The new measures are applied to study the spatial association of taxi movements in the city of Shanghai. Complications due to the edge effect are also examined.
Conference Paper
Spatio-temporal and geo-referenced datasets are growing rapidly, with the rapid development of some technology, such as GPS, satellite systems. At present, many scholars are very interested in the clustering of the trajectory. Existing trajectory clustering algorithms group similar trajectories as a whole and can't distinguish the direction of trajectory. Our key finding is that clustering trajectories as a whole could miss common sub-trajectories and trajectory has direction information. In many applications, discovering common sub-trajectories is very useful. In this paper, we present a trajectory clustering algorithm CTHD (clustering of trajectory based on hausdorff distance). In the CTHD, the trajectory is firstly described by a sequence of flow vectors and partitioned into a set of sub-trajectory. Next the similarity between trajectories is measured by their respective Hausdorff distances. Finally, the trajectories are clustered by the DBSCAN clustering algorithm. The proposed algorithm is different from other schemes using Hausdorff distance that the flow vectors include the position and direction. So it can distinguish the trajectories in different directions. The experimental result shows the phenomenon.