BookPDF Available

Spatial Data Analysis: Theory and Practice

Authors:

Abstract

Preface Readership Acknowledgements Introduction Part I. The Context for Spatial Data Analysis: 1. Spatial data analysis: scientific and policy context 2. The nature of spatial data Part II. Spatial Data: Obtaining Data And Quality Issues: 3. Obtaining spatial data through sampling 4. Data quality: implications for spatial data analysis Part III. The Exploratory Analysis of Spatial Data: 5. Exploratory analysis of spatial data 6. Exploratory spatial data analysis: visualisation methods 7. Exploratory spatial data analysis: numerical methods Part IV. Hypothesis Testing in the Presence of Spatial Autocorrelation: 8. Hypothesis testing in the presence of spatial dependence Part V. Modeling Spatial Data: 9. Models for the statistical analysis of spatial data 10. Statistical modeling of spatial variation: descriptive modeling 11. Statistical modeling of spatial variation: explanatory modeling Appendices References Index.
Spatial
Data
Analysis
Theory and
Practice
ROBERT HAINING
University of Cambridge
CAMBRIDGE
UNIVERSITY PRESS
Contents
Preface xv
Acknovvledgements xvii
Introduction
1
0.1 About the book 1
0.2 What is spatial data analysis? 4
0.3 Motivation for the book 5
0.4 Organization 8
0.5 The spatial data matrix 10
Part A
The context for spatial data analysis
1
Spatial data analysis: scientific and policy context
15
1.1 Spatial data analysis in science 15
1.1.1 Generic issues of place, context and space in scientific
explanation 16
(a)
Location as place and context
16
(b)
Location and spatial relationships 18
1.1.2 Spatial processes 21
1.2
Place and space in specific areas of scientific explanation 22
1.2.1 Defining spatial subdisciplines 22
1.2.2
Examples: selected research areas
24
(a)
Environmental criminology 24
(b)
Geographical and environmental (spatial)
epidemiology
26
(c)
Regional economics and the new economic
geography 29
vii
viii
Contents
(d)
Urban studies 31
(e)
Environmental sciences 32
1.2.3 Spatial data analysis in problem solving 33
1.3
Spatial data analysis in the policy area 36
1.4
Some examples of problems that arise in analysing
spatial data 40
1.4.1 Description and map interpretation 40
1.4.2
Information redundancy 41
1.4.3 Modelling 41
1.5
Concluding remarks 41
2 The nature of spatial data 43
2.1
The spatial data matrix: conceptualization and
representation issues 44
2.1.1
Geographic space: objects, fields and geometric
representations 44
2.1.2 Geographic space: spatial dependence in attribute
values 46
2.1.3
Variables 47
(a)
Classifying variables 48
(b)
Levels ofmeasurement 50
2.1.4 Sample or population? 51
2.2
The spatial data matrix: its form 54
2.3
The spatial data matrix: its quality 57
2.3.1
Model quality
58
(a)
Attribute representation
59
(b)
Spatial representation: general considerations 59
(c)
Spatial representation: resolution and
aggregation 61
2.3.2 Data
quality 61
(a)
Accuracy 63
(b)
Resolution 67
(c)
Consistency 70
(d)
Completeness 71
2.4 Quantifying
spatial dependence 74
(a)
Fields: data from two-dimensional continuous
space 74
(b)
Objects: data from two-dimensional discrete
space 79
2.5
Concluding remarlcs 87
Contents ix
Part B
Spatial data: obtaining data and quality issues
3
Obtaining spatial data through sampling
91
3.1 Sources of spatial data 91
3.2 Spatial sampling 93
3.2.1 The purpose and conduct of spatial sampling 93
3.2.2 Design- and model-based approaches to spatial
sampling 96
(a)
Design-based approach to sampling 96
(b)
Model-based approach to sampling 98
(c)
Comparative comments 99
3.2.3 Sampling plans 100
3.2.4 Selected sampling problems 103
(a)
Design-based estimation of the population mean 103
(b)
Model-based estimation of means 106
(c)
Spatial prediction 107
(d)
Sampling to identify extreme values or detect
rare events 108
3.3 Maps through simulation 113
4
Data quality: implications for spatial data analysis
116
4.1 Errors in data and spatial data analysis 116
4.1.1 Models for measurement error 116
(a)
Independent error models 117
(b)
Spatially correlated error models 118
4.1.2 Gross errors 119
(a)
Distributional outliers 119
(b)
Spatial outliers 122
(c)
Testing for outliers in large data sets 123
4.1.3 Error propagation 124
4.2 Data resolution and spatial data analysis 127
4.2.1 Variable precision and tests of significance 128
4.2.2 The change of support problem 129
(a)
Change of support in geostatistics 129
(b)
Areal interpolation 131
4.2.3 Analysing relationships using aggregate data 138
(a)
Ecological inference: parameter estimation 141
(b)
Ecological inference in environmental epidemiology:
identifying valid hypotheses 147
(c)
The modifiable areal units problem (MAUP) 150
x Contents
4.3 Data consistency and spatial data analysis 151
4.4 Data completeness and spatial data analysis 152
4.4.1 The missing-data problem 154
(a)
Approaches to analysis when data are missing 156
(b)
Approaches to analysis when spatial data are
missing 159
4.4.2 Spatial interpolation, spatial prediction 164
4.4.3 Boundaries, weights matrices and data completeness 174
4.5 Concluding remarks 177
Part C
The exploratory analysis of spatial data
5
Exploratory spatial data analysis: conceptual models
181
5.1 EDA and ESDA 181
5.2 Conceptual models of spatial variation 183
(a)
The regional model 183
(b)
Spatial `rough' and `smooth' 184
(c)
Scales of spatial variation 185
6
Exploratory spatial data analysis: visualization methods
188
6.1 Data visualization and exploratory data analysis 188
6.1.1 Data visualization: approaches and tasks 189
6.1.2 Data visualization: developments through computers 192
6.1.3 Data visualization: selected techniques 193
6.2 Visualizing spatial data 194
6.2.1 Data preparation issues for aggregated data: variable
values 194
6.2.2 Data preparation
issues
for aggregated data: the spatial
framework 199
(a)
Non-spatial approaches to region building 200
(b)
Spatial approaches to region building 201
(c)
Design criteria for region building 203
6.2.3 Special issues in the visualization of spatial data 206
6.3 Data visualization and exploratory spatial data analysis 210
6.3.1 Spatial data visualization: selected techniques for univariate
data 211
(a)
Methods for data associated with point or area
objects 211
(b)
Methods for data from a continuous surface 215
6.3.2 Spatial data visualization: selected techniques for bi- and
multi-variate data 218
Contents xi
6.3.3 Uptake of breast cancer screening in Sheffield 219
6.4 Concluding remarks 225
7
Exploratory spatial data analysis: numerical methods
226
7.1 Smoothing methods 227
7.1.1 Resistant smoothing of graph plots 227
7.1.2 Resistant description of spatial dependencies 228
7.1.3 Map smoothing 228
(a)
Simple mean and median smoothers 230
(b)
Introducing distance weighting 230
(c)
Smoothing rates 232
(d)
Non-linear smoothing: headbanging 234
(e)
Non-linear smoothing: median polishing 236
(f)
Some comparative examples 237
7.2 The exploratory identification of global map properties: overall
dustering 237
7.2.1 Clustering in area data 242
7.2.2 Clustering in a marked point pattern 247
7.3 The exploratory identification oflocal map properties 250
7.3.1 Cluster detection 251
(a)
Area data 251
(b)
Inhomogeneous point data 259
7.3.2 Focused tests 263
7.4 Map comparison 265
(a)
Bivariate association 265
(b)
Spatial association 268
Part
D Hypothesis testing and spatial autocorrelation
8
Hypothesis testing in the presence of spatial
dependence
273
8.1 Spatial autocorrelation and testing the mean of a spatial
data set 275
8.2 Spatial autocorrelation and tests of bivariate
association 278
8.2.1 Pearson's product moment correlation coefficient 278
8.2.2 Chi-square tests for contingency tables 283
Part E
Modelling spatial data
9
Models for the statistical analysis of spatial data
289
9.1 Descriptive models 292
9.1.1 Models for large-scale spatial variation 293
xii Contents
9.1.2 Models for Small-scale spatial variation 293
(a)
Models for data from a surface 293
(b)
Models for continuous-valued area data 297
(c)
Models for discrete-valued area data 304
9.1.3 Models with several scales of spatial variation 306
9.1.4 Hierarchical Bayesian models 307
9.2 Explanatory models 312
9.2.1 Models for continuous-valued response variables: normal
regression models 312
9.2.2 Models for discrete-valued area data: generalized linear
models 316
9.2.3 Hierarchical models
(a)
Adding covariates to hierarchical Bayesian models 320
(b)
Modelling spatial context: multi-level models 321
10 Statistical modelling of spatial variation: descriptive
modelling
325
10.1 Models for representing spatial variation 325
10.1.1 Models for continuous-valued variables 326
(a)
Trend surface models with independent errors 326
(b)
Semi-variogram and covariance models 327
(c)
Trend surface models with spatially correlated errors 331
10.1.2 Models for discrete-valued variables 334
10.2 Some general problems in modelling spatial variation 338
10.3 Hierarchical Bayesian models 339
11 Statistical modelling of spatial variation: explanatory
modelling
350
11.1 Methodologies for spatial data modelling 350
11.1.1 The 'classical' approach 350
11.1.2 The econometric approach 353
(a)
A general spatial specification 355
(b)
Two models of spatial pricing 356
11.1.3 A
`
data-driven' methodology 358
11.2 Some applications of linear modelling of spatial data 358
11.2.1 Testing for regional income convergence 359
11.2.2 Models for binary responses 361
(a)
A logistic model with spatial lags an the covariates 361
(b)
Autologistic models with covariates 364
11.2.3 Multi-level modelling 365
Contents xiii
11.2.4 Bayesian modelling of burglaries in Sheffield 367
11.2.5 Bayesian modelling of children excluded from school
376
11.3
Concluding comments 378
Appendix I Software
379
Appendix II Cambridgeshire lung cancer data
381
Appendix III Sheffield burglary data
385
Appendix IV Children excluded from school: Sheffield
391
References
394
Index 424
... Statistical analysis Anselin and Griffith (1988) warned some time ago that spatial heterogeneity could be present altogether with spatial clustering (or dependence). Likewise, many have warned that spatial heterogeneity may be due to model misspecification (e.g., omitted variable bias, measurement error bias, measurement-induced heterogeneity, etc.) or varying functional forms and model parameters evidencing structural instability (Getis, 1996;Haining, 2003). The first issue results in heteroskedasticity while the second may be the result of true local contextual effects -i.e., residents in neighborhoods behave differently independently of their social composition or environmental characteristics. ...
Article
Full-text available
This study shows that the global and local spatial patterns of burglary rates in the District of Columbia neighborhoods varied significantly between 2019 and 2021, and that their relationship with Concentrated Disadvantage (CD) was spatially heterogeneous. Hotspot clusters of neighborhoods with high levels of burglary changed rapidly from one year to another, while clusters with positive and negative local associations between burglary and CD did not change significantly over time. The mains lessons are that burglary hotspots are harder to predict than bivariate burglary and CD hotspots, and that the previous relationship varies significantly across neighborhoods. The research and policy implication is that we need to move beyond the spatial univariate analysis of hotspots of crime, to more detailed spatial bivariate analyses of correlates of crime.
... Spatial data analysis can help in the quest for scientific explanations. Since observations in geographic space are frequently associated, it also plays a role in more broad problem solving [14]. The study assists decision-makers in determining priorities when establishing options for allocating public health resources. ...
Preprint
Full-text available
In this study, a spatial latent trait model was developed to address the challenge of parameter estimation for ordinal response variables. The development of the model involved employing the Bayesian rank likelihood estimation method. The simulation algorithm was provided in detail, and the performance and sensitivity of the developed method were evaluated using simulation techniques. Method evaluation was conducted to identify any convergence issues in the developed method. The results showed that trace plots of all parameters (β, υ, and γ) showed good mixing and quick convergence. The potential scale reduction factor value for all parameters did not exceed one, indicating that convergence issues were not identified. Additionally, the developed method performed well, as demonstrated by the posterior predictive check, since simulated data generated from the posterior predictive distribution closely resemble the observed data. The developed method also effectively captures within-region variations and spatial correlations between the regions through the latent traits parameters. The assessment of performance included metrics such as root mean square error, mean absolute error, and the probability coverage of the corresponding 95% confidence intervals of the estimates. The results indicate that the estimates obtained from the developed method outperform the existing classical estimates. As a result, it can be concluded that the spatial latent trait model using Bayesian rank likelihood estimation is regarded as the better model.
... For this analysis, the selected spatial weights matrix, to determine which of the surrounding pixels are deemed to be spatially autocorrelated, was based on Delaunay triangulation (or Voronoi diagram) [46] . The spatial weights matrix was row standardized to mitigate the bias that occurs when the number of neighbours is dependent upon the aggregation strategy that is employed [47][48] [49] and [50] . The hotspot analysis was also conducted with the application of the false discovery rate correction, which reduces the critical p-values to account for the effects of spatial dependence and multiple tests [51] . ...
Article
Full-text available
Our study encompasses the Oil Sands Area (OSA) within northern Alberta, Canada, which has experienced substantial environmental changes over the last decades, in association with natural and anthropogenic disturbances. Using composites of Landsat imagery for 5-year intervals between 2000 and 2020, we performed two parallel geospatial analyses to assess environmental changes, examining landscape metrics and spectral indices. Landscape metrics were calculated from land use/land cover maps derived from a Random Forest supervised classification. Spectral indices included Normalized Difference Vegetation Index (NDVI) and Normalised Difference Built-up Index (NDBI), among others. Both hierarchical zonal analysis of spectral indices and zonal landscape metrics were calculated based upon two different aggregations of nested drainage basin features from hydrologic unit code (HUC - Watersheds of Alberta). Spatial contiguity of changes was evaluated by hotspot analysis. HUCs determined to experience significant changes at coarse aggregation level were examined at finer level. The combination of landscape metrics and zonal analysis provided evidence of substantial, yet localized, areas of changing trends. Mixed forest experienced the most significant changes; urban/barren areas initially increased and later decreased, indicating change both in agricultural and human-made areas.
... 27 In a spatial regression model, the relationship between the dependent variable and the independent variables is adjusted taking spatial dependence into account. [27][28][29] Also, we can consider longitudinal models to include the spatial dependence. 30 Regression models: Regression -is a statistical technique that seeks to identify relationships between variables, allowing the prediction or understanding of how a dependent variable is influenced by one or more independent variables. ...
Article
In China, the country of COVID-19 origin, until February 23rd, 2020, more than 77000 cases of COVID-19 infection were reported, and 60% of confirmed cases were reported in the city of Wuhan. Mozambique declared a state of emergency in March 2020, different prevention measures were implemented to control and respond in a timely manner to the pandemic, including the early diagnosis of cases of the disease. The present work reports some details about a larger project with the main objective of computing models of analysis and visualization of COVID-19 data in Mozambique. The topic falls within the area of Statistics with the purpose of providing evidence that explains the stage of the country regarding the evolution of COVID-19 cases, (from the notification of the first case of COVID-19 in Mozambique on March 22nd, 2020, until May 31st, 2022) with the focus on the provinces of Maputo, Nampula, Cabo Delgado and Niassa. The work considered qualitative and quantitative data to allow decision-making in the health area on measures to prevent the pandemic and the trend of cases and deaths from the disease.
... Although there is no consensual explanation for the scientific definition of space in theoretical circles (Sack, 1974;Montello and Sutton, 2006;Wither, 2009;Liu et al., 2017;Olechnicka et al., 2019;Komilova et al., 2021), the recognition of the existence of "space" and "spatial attributes" is surprisingly consistent, and gives "space" various natural, social and economic attributes (Haining, 2003). Bourdieu (1998) believed that space itself does not have practical sig-nificance, and it is important only when human economic and social activities give it a corresponding position. ...
Article
Full-text available
The impact of spatial heterogeneity on industrial outputs is a new important topic in economic geography. A considerable amount of research literature has accumulated, but the academic community lacks a systematic and comprehensive review and consensus on this topic. This study carried out research by mining the relevant classical literature. This investigation first combed the connotation of spatial heterogeneity, which is both corresponding to and related to spatial dependence. Theorists generally acknowledge that there is spatial heterogeneity in the process of industrial outputs. Then this study summarizes the logical basis, relationship coordination, measurement and other aspects of the effect of spatial heterogeneity on industrial outputs. In analyzing the impact of spatial het-erogeneity on industrial outputs, we should not ignore the spatial dimension, but must also pay attention to the heterogeneity of individual enterprises. Industrial output analysis needs to be based on the relationship between spatial heterogeneity and spatial dependence. The influence of spatial heterogeneity on industrial outputs and the degree of differences among observation objects can be measured by econometric methods. The common indicators for measuring and quantitatively describing the impact of spatial heterogeneity on industrial outputs mainly include semivariogram, the spatial expansion model and the geographical weighted regression model. Finally, some directions of future research are pointed out in order to provide useful ideas for future theoretical research and industrial practice.
Article
Full-text available
Amazonia contains the most extensive tropical forests on Earth, but Amazon carbon sinks of atmospheric CO2 are declining, as deforestation and climate-change-associated droughts1–4 threaten to push these forests past a tipping point towards collapse5–8. Forests exhibit complex drought responses, indicating both resilience (photosynthetic greening) and vulnerability (browning and tree mortality), that are difficult to explain by climate variation alone9–17. Here we combine remotely sensed photosynthetic indices with ground-measured tree demography to identify mechanisms underlying drought resilience/vulnerability in different intact forest ecotopes18,19 (defined by water-table depth, soil fertility and texture, and vegetation characteristics). In higher-fertility southern Amazonia, drought response was structured by water-table depth, with resilient greening in shallow-water-table forests (where greater water availability heightened response to excess sunlight), contrasting with vulnerability (browning and excess tree mortality) over deeper water tables. Notably, the resilience of shallow-water-table forest weakened as drought lengthened. By contrast, lower-fertility northern Amazonia, with slower-growing but hardier trees (or, alternatively, tall forests, with deep-rooted water access), supported more-drought-resilient forests independent of water-table depth. This functional biogeography of drought response provides a framework for conservation decisions and improved predictions of heterogeneous forest responses to future climate changes, warning that Amazonia’s most productive forests are also at greatest risk, and that longer/more frequent droughts are undermining multiple ecohydrological strategies and capacities for Amazon forest resilience.
Article
Full-text available
Background Epidemics of the dengue virus can trigger widespread morbidity and mortality along with no specific treatment. Examining the spatial autocorrelation and variability of dengue prevalence throughout Bangladesh's 64 districts was the focus of this study. Methods The spatial autocorrelation is evaluated with the help of Moran I $I$ and Geary C $C$. Local Moran I $I$ was used to detect hotspots and cold spots, whereas local Getis Ord G was used to identify only spatial hotspots. The spatial heterogeneity has been detected using various conventional and spatial models, including the Poisson‐Gamma model, the Poisson‐Lognormal Model, the Conditional Autoregressive (CAR) model, the Convolution model, and the BYM2 model, respectively. These models are implemented using Gibbs sampling and other Bayesian hierarchical approaches to analyze the posterior distribution effectively, enabling inference within a Bayesian context. Results The study's findings show that Moran I $I$and Geary C $C$analysis provides a substantial clustering pattern of positive spatial autocorrelation of dengue fever (DF) rates between surrounding districts at a 90% confidence interval. The Local Indicators of Spatial Autocorrelation cluster mapped spatial clusters and outliers based on prevalence rates, while the local Getis‐Ord G displayed a thorough breakdown of high or low rates, omitting outliers. Although Chattogram had the most dengue cases (15,752), Khulna district had a higher prevalence rate (133.636) than Chattogram (104.796). The BYM2 model, determined to be well‐fitted based on the lowest Deviance Information Criterion value (527.340), explains a significant association between spatial heterogeneity and prevalence rates. Conclusion This research pinpoints the district with the highest prevalence rate for dengue and the neighboring districts that also have high risk, allowing government agencies and communities to take the necessary precautions to mollify the risk effect of DF.
Article
Bu çalışmada, belediye ve bakanlık belgeli tesislere gelen yerli turistlere ilişkin istatistikler, mekânsal istatistik yöntemlerine göre incelenmiştir. 2018, 2020 ve 2022 yılları esas alınarak pandemi sürecinin iç turizmdeki etkisi ortaya konulmuştur. Verilerin mekânsal dağılımını belirleyebilmek için küresel ve yerel olarak Moran’s I yöntemleri kullanılarak mekânsal otokorelasyon analizi yapılmıştır. İlçe ölçeğinde gerçekleştirilen analiz sonucunda, pandeminin tercih edilen turizm mekânlarında, yerli turistler özelinde belirgin bir değişime sebep olmadığı ortaya konulmuştur. Salgın dönemi ve salgın sonrasındaki dönemde sıklıkla ifade edilen radikal değişiklik söylemleri en azından bu süreler içerisinde istatistiksel olarak gerçekleşmemiştir. Ancak değişim uzun bir zaman dilimine ihtiyaç duymaktadır. Yaşanan krizler sektörü etkilediği gibi turistlerin tercihlerini de etkilemektedir. Risk ve krizleri azaltmak, planlamaların yapılması, sürdürülebilir turizm gelişiminin sağlanması, ekonomik, sosyal ve çevresel sonuçları izlemek için turist tutum ve davranışlarını takip etmek önem arz etmektedir. Bundan dolayı çalışmada coğrafi kümelenme eğilimi değerlendirilmiştir. Sonuçlar, Türkiye’de iç turizmin giderek daha fazla fark edilir hale gelen coğrafi kümelenme eğiliminde olduğunu ortaya koymaktadır.
Article
Full-text available
A test statistic for the detection of spatial clusters is developed by generalizing the common chi‐square goodness‐of‐fit test. The paper includes a discussion of the relationship between the statistic and other associated statistics, and provides an analysis of both its null distribution and power. The paper concludes with the development of a local version of the statistic and an application to leukemia clustering in central New York.
Article
Full-text available
Studies of spatial market integration draw their implications from a theory which assumes that there are no intraregional transport costs. An alternative theory is offered, based on the assumptions that buyers and sellers are spatially dispersed and intraregional transport costs are significant. This implies that the market is a linked oligopoly (or oligopsony) and that market integration tests are tests of alternative oligopoly price formation processes. For example, collusive basing-point pricing produces results typically assumed to imply efficiently integrated markets, while competitive FOB pricing does not. The theoretical implications are illustrated with an analysis of hog prices in Canada.
Article
It is often of interest to find the maximum or near maxima among a set of vector-valued parameters in a statistical model; in the case of disease mapping, for example, these correspond to relative-risk “hotspots” where public-health intervention may be needed. The general problem is one of estimating nonlinear functions of the ensemble of relative risks, but biased estimates result if posterior means are simply substituted into these nonlinear functions. The authors obtain better estimates of extrema from a new, weighted ranks squared error loss function. The derivation of these Bayes estimators assumes a hidden-Markov random-field model for relative risks, and their behaviour is illustrated with real and simulated data.Les valeurs maximale ou quasi-maximale du vecteur de paramètres d'un modèle statistique revětent souvent un intérět particulier; sur une carte des cas de maladie, par exemple, ces valeurs correspondent à des “points chauds” pouvant nécessiter une intervention publique. Le problème se réduit généralement à estimer des fonctions non linéaires de l'ensemble des risques relatifs, mais les estimations obtenues sont biaisées si on se borne à y remplacer les moyennes a posteriori. Les auteurs proposent de meilleurs estimateurs des extrěmes, construits au moyen d'une nouvelle fonction de perte quadratique pondérée basée sur les rangs. Le calcul de ces estimateurs de Bayes suppose que les risques relatifs sont modélisés à l'aide d'un champ aléatoire de Markov caché; leur comportement est illustré à l'aide de données réelles et simulées.