Group challenge 1: Example
Spatial Statistics - BIOSTAT 696/896
Michele Peruzzi
University of Michigan
Introduction
- We consider a scientific problem in the area of whateverology
- Data on this topic are available at website https://whateverology.idk
- Specifically we target the study of ABC, using data https://whateverology.idk/ABC.csv
- The dataset is a point-referenced dataset of 5000 observations
Data overview
- The dataset is of dimension 5000, 5; in addition to latitude and longitude, it includes variables
x
, y
, and z
- A histogram of all variables they are all roughly centered at zero
Data overview
- Variable Y is missing at 20.3 percent of total locations
- Variables X and Y are linearly negatively correlated
- Variable Z seems unrelated to the other two
Data type and locations
- Data locations cover the spatial domain with uniformity
- We consider the variables X and Y over the domain D
More visualization
- We visualize interesting features of the data!
[Here]
Maps of the data
Explaining Y via X
- By inspecting the maps of X and Y we see that there may be a negative association between them
- This makes sense because X explains Y in a certain way according to scientific knowledge
- However, we note that Y is missing in some large areas where X is available
- Further investigation to figure out whether missingness may impact results
- Value of Z may relate to whether Y is missing. We explain this with…
Brainstorming models for this dataset
- We think X \to Y and Z \to Y
- We consider X and Z as covariates
- The measurements of Y are reasonably noisy
- Y has spatial variability that may not be fully explained by X and Z
- Therefore, the DAG could look like \theta \to Y, \tau \to Y where the former is a vector of unknown parameters describing spatial dependence, \tau is an unknown scalar which refers to how precise our measurements of Y are in this dataset
Preliminary analysis
- We consider a linear regression of X and Z on Y
- Coefficients on X and Z are -0.943137 and -0.0409616, respectively
- P-value for coefficient on X is very small, and the one on Z is statistically significantly different from zero at 5% confidence
- However, we are unsatisfied with this analysis because the residuals show significant spatial variability
- Because X and Z alone do not satisfactorily explain all variability in Y, we need to conduct more analyses
Concluding words
- We have identified a potential route for modeling this dataset by explaining variability of Y via X and Z
- Preliminary analyses seem to confirm this relationship
- Simple linear regression is unable to capture all spatial variability
- We target the development of a more complex model incorporating spatial variability
- The fact that spatial variability remains important makes sense: Y can be explained by variable W, which is not in this dataset
- The unobserved variable W has an intuitive spatial pattern that may resemble the spatial pattern of regression residuals
- Our proposed intuitive model ignores the fact that missingness of Y may be explained by Z but we believe our assumption is reasonable because…