Group challenge 1

Spatial Statistics - BIOSTAT 696/896

Michele Peruzzi

University of Michigan

Groups?

  • Have some groups been formed?
  • Troubles?
  • I have enabled Piazza if you need to find group partners
  • You could e.g. post your background (MS/PhD/what year…) and look for complementary group members

Getting started

  • This group challenge is about getting things started with point-referenced and areal data
  • The goal of this group activity is to make the first steps towards your final group project
  • Some of the points I include can only be completed if you have your own data
  • For these reasons, although not required, I very highly recommend that you begin searching for data for your final project
  • Because it is still early, you should search for a point-referenced dataset AND an areal dataset

Point-referenced data requirements

  • Spatial coordinates (latitude and longitude, or equivalent depending on the domain)
  • Not necessarily environmental/ecology/satellite
  • Must have a continuous variable that varies in space
  • If you do preprocessing (eg data transformations), must document everything
  • Check in with me before submitting your answers

Areal data requirements

  • Higher resolution than state-level (US) or country-level (World). Eg. county-level works
  • Much easier to find areal data, therefore I will be more picky about your research question (must be “interesting”, subjectively)
  • Check in with me before submitting your answers

Checklist for both datasets, part 1

  • Source of the data. Link, citation.
  • Overall data description (number of observations, number of variables)
  • Are there missing data? Is there a pattern or an intuitive reason for the missingness?
  • What is in the dataset? Describe each column. Summary statistics. Correlation analysis (non-spatial is ok). Preliminary data analysis using methods that you know
  • Make summary figures based on non-spatial descriptive statistics
  • Describe the spatial component of the data: what is the spatial domain, its dimension, how are observations indexed in space? Do you think modeling covariance as decaying with distance is appropriate?
  • For the point-referenced dataset: visualize the empirical covariogram for the variable of interest. Does the variogram suggest spatial dependence in the data?
  • Map the data. Map the spatial variables, and write short but meaningful captions for your figures.

Checklist for both datasets, part 2

  • How do you imagine spatial dependence may play a role between the variables in the dataset?
  • What are some research questions that these data could help answer (at least one for each dataset)?
  • Write an intuitive directed acyclic graphical model outlining the important variables, the parameters, and how they could be related to each other (no need for other assumptions – keep it simple)
  • What are the potential results that you may anticipate?
  • What are some potential pitfalls or shortcomings of your model?
  • What additional data could be useful for the purpose of your analysis?

Tips

  • Keep it simple!
  • You can start thinking about your final project format now. Slides/poster
  • You can submit your answers as a slide set or in poster format
  • If you choose poster: OK to leave lots of empty space now, or to cut later
  • Submit your code too!
  • Preferred formats: .qmd file that compiles into a .pdf document. Submit both as .zip archive

Can’t find data?

  • You can use the purpleair.csv dataset for point-referenced data
  • You can use the heart_disease.csv dataset for areal data

heart_disease.csv information

purpleair.csv information

  • This is a completely new unpublished dataset, never used before
  • Web scraping script in Python written to download sensor data from Purple Air

Specifics of this dataset:

  • id is the sensor id for this location
  • location_type indicates whether the sensor is placed outdoors
  • Region of interest: all sensors with 30 < lat < 50 and -125 < lon < -115
  • Original data has 2-minute frequency. This dataset has daily summaries
  • Variables in the dataset are humidity (relative, in %), temp (in F), pm25a and pm25b (particulate matter 2.5 micron), pm01 (particulate matter 1 micron) and pm10 (particulate matter 10 micron)
  • pm25b has a different unit of measurement (use pm25a instead)
  • Daily summaries are _low (5% percentile), _quart1 (25% percentile), _median, _mean, _quart3 (75% percentile), high (95% percentile)
  • Example: if pm25a_median=20 then half the measurements at this location during the day were over 20 \mu g/m^3