Group challenge 1
Spatial Statistics - BIOSTAT 696/896
Michele Peruzzi
University of Michigan
Groups?
- Have some groups been formed?
- Troubles?
- I have enabled Piazza if you need to find group partners
- You could e.g. post your background (MS/PhD/what year…) and look for complementary group members
Getting started
- This group challenge is about getting things started with point-referenced and areal data
- The goal of this group activity is to make the first steps towards your final group project
- Some of the points I include can only be completed if you have your own data
- For these reasons, although not required, I very highly recommend that you begin searching for data for your final project
- Because it is still early, you should search for a point-referenced dataset AND an areal dataset
Point-referenced data requirements
- Spatial coordinates (latitude and longitude, or equivalent depending on the domain)
- Not necessarily environmental/ecology/satellite
- Must have a continuous variable that varies in space
- If you do preprocessing (eg data transformations), must document everything
- Check in with me before submitting your answers
Areal data requirements
- Higher resolution than state-level (US) or country-level (World). Eg. county-level works
- Much easier to find areal data, therefore I will be more picky about your research question (must be “interesting”, subjectively)
- Check in with me before submitting your answers
Checklist for both datasets, part 1
- Source of the data. Link, citation.
- Overall data description (number of observations, number of variables)
- Are there missing data? Is there a pattern or an intuitive reason for the missingness?
- What is in the dataset? Describe each column. Summary statistics. Correlation analysis (non-spatial is ok). Preliminary data analysis using methods that you know
- Make summary figures based on non-spatial descriptive statistics
- Describe the spatial component of the data: what is the spatial domain, its dimension, how are observations indexed in space? Do you think modeling covariance as decaying with distance is appropriate?
- For the point-referenced dataset: visualize the empirical covariogram for the variable of interest. Does the variogram suggest spatial dependence in the data?
- Map the data. Map the spatial variables, and write short but meaningful captions for your figures.
Checklist for both datasets, part 2
- How do you imagine spatial dependence may play a role between the variables in the dataset?
- What are some research questions that these data could help answer (at least one for each dataset)?
- Write an intuitive directed acyclic graphical model outlining the important variables, the parameters, and how they could be related to each other (no need for other assumptions – keep it simple)
- What are the potential results that you may anticipate?
- What are some potential pitfalls or shortcomings of your model?
- What additional data could be useful for the purpose of your analysis?
Tips
- Keep it simple!
- You can start thinking about your final project format now. Slides/poster
- You can submit your answers as a slide set or in poster format
- If you choose poster: OK to leave lots of empty space now, or to cut later
- Submit your code too!
- Preferred formats:
.qmd
file that compiles into a .pdf
document. Submit both as .zip
archive
Can’t find data?
- You can use the
purpleair.csv
dataset for point-referenced data
- You can use the
heart_disease.csv
dataset for areal data