Technical Report
Technical Report: Utah Crime Incident Rate Analysis
Motivating Question
Do crime incident rates in Utah vary by season, city, or year?
This question has practical implications for resource allocation by law enforcement agencies. If crime rates spike in particular seasons, targeted seasonal interventions could be justified. If rates differ primarily by city or year, the focus should shift to city-level policy and long-term trend analysis.
Dataset
Sources
- Crime data: 24 city-level police report datasets downloaded from the Utah Open Data Portal, covering cities including Salt Lake City, Orem, Logan, Provo, South Jordan, Lehi, Roy, and others (2007–2019).
- Population data: U.S. Census Bureau population estimates accessed via the Census API for Utah places and County.
Assembly
The raw data was assembled in a fully reproducible manner using Python scripts:
scrapping/utahOpenportal.py— programmatically downloads all 24 crime CSVs from the Utah Open Data Portal.scrapping/scrape_population.py— queries the Census API for annual population estimates.
The datasets use inconsistent schemas, column names, date formats, and city name spellings — requiring substantial cleaning before analysis.
The final crime_summary.csv dataset contains 514 records, each representing a unique city × season × year combination, with a computed incident rate per 100,000 residents.
Methodology
Step 1 — Data Cleaning
Five cleaning functions in the cleaning package standardize all datasets into a common schema:
| Function | Purpose |
|---|---|
clean_city_datasets() |
Standardizes the 21 standard-format city CSVs |
clean_assult_dataset() |
Handles the non-standard SLC assault format |
clean_slc_datasets() |
Handles the SLC police agency format |
add_season() |
Derives season (Spring/Summer/Fall/Winter) from date |
clean_city_column() |
Resolves city name aliases and normalizes formatting |
All datasets are concatenated into master_crime_data.csv.
Step 2 — Incident Rate Calculation
Crime counts are aggregated by city × season × year, then merged with the Census population data. The incident rate per 100,000 residents is calculated as:
incident_rate = (crime_count / population) × 100,000
Only city-year pairs with matching population data are retained, yielding the crime_summary.csv dataset used in the analysis.
Step 3 — Statistical Analysis
A two-way ANOVA was used to test whether city, year, and season are significant predictors of the incident rate. We fit two models — one on the raw scale and one on a log-transformed scale — and compared them using AIC/BIC and residual diagnostics.
Exploratory Analysis
Before fitting any model, a box plot of incident rates by season reveals that the distributions are remarkably consistent across all four seasons. Each season exhibits some outliers, but the medians and interquartile ranges are nearly identical. This is an early visual signal that seasonal differences in incident rates are minimal.
Original ANOVA Model
Model:
incident_rate_per_100k ~ C(season) * C(city) + C(year)
| Predictor | Sum Sq | df | F | p-value |
|---|---|---|---|---|
| Intercept | 3.83e+07 | 1 | 3.68 | 0.0559 |
| C(season) | 1.80e+05 | 3 | 0.006 | 0.9994 |
| C(city) | 7.78e+09 | 17 | 43.85 | 6.49e-83 |
| C(year) | 6.25e+08 | 9 | 6.66 | 6.04e-09 |
| C(season):C(city) | 3.65e+08 | 51 | 0.69 | 0.9505 |
| Residual | 4.52e+09 | 433 | — | — |
The ANOVA model indicates that incident rates per 100,000 vary significantly by city and year, but not by season. City has a very strong effect (p ≈ 6.49e-83), showing large differences in incident rates across locations, and year is also statistically significant (p ≈ 6.04e-09), indicating that incident rates change over time. In contrast, season has no meaningful effect (p ≈ 0.999), suggesting no detectable seasonal variation in incident rates. Additionally, the interaction between season and city is not significant (p ≈ 0.95), meaning that seasonal patterns do not differ across cities.
Residual Diagnostics (Original Model)
A key issue with the original model is visible in the residuals: they are not well-behaved under the raw scale. The residual distribution is highly influenced by the skewed nature of the data and the wide range of incident rates across cities, which leads to non-ideal assumptions for ANOVA. In particular, the residuals reflect heteroskedasticity (unequal variance across groups) and non-normal structure in the original scale, meaning the model fit is driven heavily by large city effects and does not stabilize variance well without transformation.
Log-Transformed ANOVA Model
To address the residual issues, a log transformation was applied:
Model:
log(1 + incident_rate_per_100k) ~ C(season) * C(city) + C(year)
| Predictor | Sum Sq | df | F | p-value |
|---|---|---|---|---|
| Intercept | 36.85 | 1 | 14.77 | 1.40e-04 |
| C(season) | 11.07 | 3 | 1.48 | 0.2195 |
| C(city) | 580.24 | 17 | 13.68 | 4.17e-31 |
| C(year) | 110.93 | 9 | 4.94 | 2.49e-06 |
| C(season):C(city) | 101.86 | 51 | 0.80 | 0.8355 |
| Residual | 1080.52 | 433 | — | — |
The log-transformed results confirm the same pattern: incident rates are significantly influenced by both city and year, while season does not have a statistically significant effect. City shows a strong effect (p ≈ 4.17e-31) and year is also significant (p ≈ 2.49e-06). Season remains not significant (p = 0.219), and the season × city interaction is not significant (p = 0.835).
Residual Diagnostics (Log Model)
The log transformation substantially improves residual behavior. The Q-Q plot shows points that track the 45° reference line much more closely, and the residual histogram approximates a normal distribution more convincingly than the original model. This confirms that the log scale is the appropriate scale for inference.
Model Comparison
| Original Model | Log Model | |
|---|---|---|
| AIC | 9,839.03 | 2,002.56 |
| BIC | 10,182.65 | 2,346.18 |
Both AIC and BIC decrease dramatically after the log transformation — a reduction of roughly 7,800 points — indicating that the log-transformed model provides a much better balance of goodness-of-fit and model simplicity. This aligns with the diagnostic evidence: the original model suffered from non-constant variance and poorly behaved residuals, while the log transformation stabilized variance and improved normality.
Importantly, despite the large improvement in model fit, the overall conclusions are consistent across both models: city and year are significant predictors of incident rates, while season and the season × city interaction are not statistically significant.
Key Findings
| Predictor | Significant? | p-value (log model) |
|---|---|---|
| City | Yes | 4.17e-31 |
| Year | Yes | 2.49e-06 |
| Season | No | 0.219 |
- City is the strongest predictor: crime rates differ substantially between municipalities. Cities with larger populations or higher transient populations (e.g., Salt Lake City) show higher raw incident rates, but after normalizing by population, patterns vary.
- Year is significant: incident rates show temporal trends, consistent with broader national crime trends.
- Season has no significant effect: despite common assumptions about seasonal crime variation, the data do not support a seasonal pattern in Utah across these years. This finding holds in both the raw and log-transformed models.
Workflow Summary (Reproducibility)
1. python scrapping/utahOpenportal.py # download raw CSVs
2. python scrapping/scrape_population.py # download Census population data
3. python cleaning/cleaning_data.py # → cleaned_data/master_crime_data.csv
4. python cleaning/clean_analysis_data.py # → cleaned_data/crime_summary.csv
5. jupyter notebook Analysis/incident_rate_model.ipynb # ANOVA analysis
6. python run.py # interactive heatmap visualization
7. streamlit run streamlit_app.py # Streamlit dashboard
Limitations
- Missing geographic coverage: Not all Utah cities have open data available. The analysis covers approximately 20 municipalities, which may not be representative of the full state.
- Inconsistent reporting periods: Some datasets span fewer years than others, leading to unbalanced panel data.
- Crime category harmonization: Incident type labels varied significantly across datasets; normalization introduces some classification uncertainty.
- Census population interpolation: Annual population estimates are used; actual populations may vary within years.
Tools Used
- Python (pandas, numpy, statsmodels, matplotlib, seaborn, plotly, streamlit)
- JavaScript (Leaflet.js) for interactive web heatmap
- U.S. Census API for population data
- Utah Open Data Portal for crime incident data