Group challenge 1

Spatial Statistics - BIOSTAT 696/896

Michele Peruzzi

University of Michigan

Groups?

Have some groups been formed?
Troubles?
I have enabled Piazza if you need to find group partners
You could e.g. post your background (MS/PhD/what year…) and look for complementary group members

This group challenge is about getting things started with point-referenced data
The goal of this group activity is to make the first steps towards your final group project
Some of the points I include can only be completed if you have your own data
For these reasons, although not required, I very highly recommend that you begin searching for data for your final project

Spatial coordinates (latitude and longitude, or equivalent depending on the domain)
Not necessarily environmental/ecology/satellite
Must have a continuous variable that varies in space
If you do preprocessing (eg data transformations), must document everything
Check in with me before submitting your answers

Source of the data. Link, citation.
Overall data description (number of observations, number of variables)
Are there missing data? Is there a pattern or an intuitive reason for the missingness?
What is in the dataset? Describe each column. Summary statistics. Correlation analysis (non-spatial is ok). Preliminary data analysis using methods that you know
Make summary figures based on non-spatial descriptive statistics
Describe the spatial component of the data: what is the spatial domain, its dimension, how are observations indexed in space? Do you think modeling covariance as decaying with distance is appropriate?
Visualize the empirical covariogram for the variable of interest. Does the variogram suggest spatial dependence in the data?
Map the data. Map the spatial variables, and write short but meaningful captions for your figures.

How do you imagine spatial dependence may play a role between the variables in the dataset?
What are some research questions that these data could help answer (at least one for each dataset)?
Write an intuitive directed acyclic graphical model outlining the important variables, the parameters, and how they could be related to each other (no need for other assumptions – keep it simple)
What are the potential results that you may anticipate?
What are some potential pitfalls or shortcomings of your model?
What additional data could be useful for the purpose of your analysis?

Keep it simple!
You can start thinking about your final project format now. Slides/poster
You can submit your answers as a slide set or in poster format
If you choose poster: OK to leave lots of empty space now, or to cut later
Submit your code too!
Preferred formats: .qmd file that compiles into a .pdf document. Submit both as .zip archive

You can use the purpleair.csv dataset for point-referenced data
Warning! If more groups use the same data, I will have to evaluate by comparing

Specifics of this dataset:

id is the sensor id for this location
location_type indicates whether the sensor is placed outdoors
Region of interest: all sensors with 30 < lat < 50 and -125 < lon < -115
Original data has 2-minute frequency. This dataset has daily summaries
Variables in the dataset are humidity (relative, in %), temp (in F), pm25a and pm25b (particulate matter 2.5 micron), pm01 (particulate matter 1 micron) and pm10 (particulate matter 10 micron)
pm25b has a different unit of measurement (use pm25a instead)
Daily summaries are _low (5% percentile), _quart1 (25% percentile), _median, _mean, _quart3 (75% percentile), high (95% percentile)
Example: if pm25a_median=20 then half the measurements at this location during the day were over 20 \mu g/m^3