A major part of our work at the Boston Area Research Initiative is making sense of large, complex data sets that are generated by administrative processes-resources that many refer to as “big data”-in order to better understand the city. One aspect of this is developing ways to measure and track the characteristics of neighborhoods in new and distinct ways that are accessible and interpretable for a wide audience. Following a methodological tradition in sociology, we call these measures ecometrics: literally, measures (-metrics) that describe a space (eco-). The updated ecometrics on crime and social disorder derived from 911 dispatches that we have released this week on our Boston Data Portal are a prime example of this work.
As part of this release, we have also for the first time made available a new data set: frequencies for all case types for all census tracts. This unique resource enables others to generate their own ecometrics from the data, which is exactly why we published it. This blog post walks through how we developed our ecometrics for crime and social disorder as a guide that others might follow to develop new measures that are of particular interest to them.
A number of years ago, Rob Sampson, Chris Winship, and I published guidelines for extracting ecometrics (or, really, any other type of measure) from “naturally-occurring data.” This includes administrative data, such as 911 dispatches, but is also applicable to data gathered from social media and internet platforms; essentially, anything that was generated as part of a business or administrative process rather than collected for the explicit purpose of research. We argued that all such efforts should consist of three parts: (1) isolating content that you want to measure from irrelevant information; (2) validate whether the measure you are creating actually reflects the ground truth you want it to; and (3) establish reliability, or how often and for what spatial scale you can generate such measures and have faith in the values (i.e., that they are robust to statistical chance).
In this post, I will focus on the first of these-isolating content. The basic challenge is that the 911 dispatches feature hundreds of case types (244 in 2018, to be precise), reflecting dozens of aspects of neighborhood context. To measure something specific, we need to distinguish those case types that are reflective of it, and discard those that are not. Meanwhile, issues two and three are largely accounted for. There is evidence that certain neighborhoods over- and under-report crimes and disorder, but this does not seem to be terribly systematic and similar data have been fruitfully employed in research and practice, particularly in criminology. In terms of reliability, the volume of 911 dispatches is sufficient to support measures for time periods as small as 2-6 months for census tracts; see the documentation accompanying the release of the 911 data for more.
When we sought to measure aspects of crime and social disorder from 911 dispatches, we did so in three steps. First, we identified case types that appeared to be a reflection of either crime or social disorder. Second, we analyzed their correlations across census tracts via exploratory factor analysis. Factor analysis enabled us to organize this collection of cases into groupings that better reflected the multiple dimensions of crime and social disorder. Third, we modified the categorizations based on this factor analysis to better reflect similarities between events, and tested how will this fit the data with a confirmatory factor analysis (using structural equation models).
Here I will replicate this process. The reader should note, however, that it is not going to produce the same exact numerical results as is reported in the documentation. The original ecometrics were constructed using census block groups in 2011. In order to make this post most relevant, I use here census tracts (the geographic scale above census block groups, and the level at which we make counts for all case types available) for 2018, the most recent data. Consequently, these results are for illustration purposes only. First, we bring in the necessary data sets (the tract data set is sourced from BARI’s Geographical Infrastructure).
tract_types<-read.csv('C:/Users/bariuser4/Downloads/911 Call Type Frequency by CT and Year.csv')
tract_types<-tract_types[c('CT_ID_10',names(tract_types)[grepl('2018',names(tract_types))])]
names(tract_types)
types<-read.csv('C:/Users/bariuser4/Downloads/911 Call Type Description 2014-19.csv')
names(types)
nrow(types[types$SocDis + types$PrivateConflict + types$Violence + types$Guns > 0,1:3])
tracts_geo<-read.csv('C:/Users/bariuser4/Documents/Dan/INC0511841/Documents/Research/Boston-Radcliffe/Geographical Infrastructure/Geographical Infrastructure v. 2014_ Final Folder/Tracts/Tracts_Boston_2010_BARI.csv')
names(tracts_geo)
Part I. Isolating Case Types
We identified 20 case types that specifically reflected crime and social disorder, like “shooting” or “drunken disturbance”; we in turn set aside the 224 other cases from 2018 that were not relevant, such as “motor vehicle accident.” These are listed below from the Call Type Description 2014-19 file included in the release.
types[types$SocDis + types$PrivateConflict + types$Violence + types$Guns > 0,1:3]
## TYPE tycod typ_eng ## 1 AB===>>> AB ASSAULT AND BATTERY ## 9 ABDWIP AB ASSAULT AND BATTERY ## 11 ABIP AB ASSAULT AND BATTERY ## 12 ABRPT AB ASSAULT AND BATTERY ## 28 ARMROBDEFAULT ARMROB ARMED ROBBERY ## 35 BEIP BE BREAKING AND ENTERING ## 86 DISTRBDRUNKS DISTRB DISTURBANCE ## 90 DISTRBPANHAN DISTRB DISTURBANCE ## 95 EDP2 EDP EMOTIONALLY DISTURBED PERSON ## 101 FDWEAPGUN FDWEAP FOUND WEAPON ## 104 FIGHTDEFAULT FIGHT FIGHT ## 160 IVPERLEWD IVPER INVESTIGATE PERSON ## 191 LANTENDEFAULT LANTEN LANDLORD TENANT ISSUE ## 230 PERGUNDEFAULT PERGUN PERSON WITH A GUN ## 231 PKNIFEDEFAULT PKNIFE PERSON WITH A KNIFE ## 243 PSHOTDEFAULT PSHOT PERSON SHOT (P) (E) ## 258 SHOTSDEFAULT SHOTS SHOTS FIRED ## 299 VANDIP VAND VANDALISM ## 300 VANDRPT VAND VANDALISM ## 302 VIORDRDEFAULT VIORDR VIOLATION OF THE RESTRAINING ORDER
types$TYPE<-as.character(types$TYPE)
types$TYPE[1]<-'AB......'
test<-tract_types[c('CT_ID_10',paste(types$TYPE[types$SocDis + types$PrivateConflict + types$Violence + types$Guns > 0],'2018',sep='_'))]
test<-merge(test,tracts_geo[c(4,7,8)],by='CT_ID_10')
Part II. Factor Analysis
Second, we examined the geographic correlations between these case types through an exploratory factor analysis. Did they form a single dimension, with all cases clustering together? Or did they break out into multiple dimensions, describing different aspects of the neighborhood? In the factor analysis below we see the suggestion of at least three dimensions. Their interpretations are not entirely clear, however. Some case types make sense together, like those related to guns hanging together (Factor 3), but some are a little sloppier.
test[24:43]<-test[2:21]/test$POP100*1000
names(test)[24:43]<-paste(names(test)[2:21],'rate',sep='_')
test<-test[test$POP100>250,]
test[44:63]<-log(test[24:43]+1)
names(test)[44:63]<-paste(names(test)[2:21],'rate','log',sep='_')
fit<-factanal(test[test$Res==1,c(44:63)],3,rotation='varimax')
print(fit, digits=2, cutoff=.4, sort=TRUE)
## ## Call: ## factanal(x = test[test$Res == 1, c(44:63)], factors = 3, rotation = "varimax") ## ## Uniquenesses: ## AB......_2018_rate_log ABDWIP_2018_rate_log ## 0.95 0.77 ## ABIP_2018_rate_log ABRPT_2018_rate_log ## 0.11 0.37 ## ARMROBDEFAULT_2018_rate_log BEIP_2018_rate_log ## 0.59 0.55 ## DISTRBDRUNKS_2018_rate_log DISTRBPANHAN_2018_rate_log ## 0.45 0.49 ## EDP2_2018_rate_log FDWEAPGUN_2018_rate_log ## 0.50 0.88 ## FIGHTDEFAULT_2018_rate_log IVPERLEWD_2018_rate_log ## 0.13 0.29 ## LANTENDEFAULT_2018_rate_log PERGUNDEFAULT_2018_rate_log ## 0.42 0.23 ## PKNIFEDEFAULT_2018_rate_log PSHOTDEFAULT_2018_rate_log ## 0.20 0.02 ## SHOTSDEFAULT_2018_rate_log VANDIP_2018_rate_log ## 0.47 0.45 ## VANDRPT_2018_rate_log VIORDRDEFAULT_2018_rate_log ## 0.38 0.34 ## ## Loadings: ## Factor1 Factor2 Factor3 ## ABRPT_2018_rate_log 0.53 0.48 ## ARMROBDEFAULT_2018_rate_log 0.57 ## BEIP_2018_rate_log 0.59 ## FIGHTDEFAULT_2018_rate_log 0.67 0.52 ## LANTENDEFAULT_2018_rate_log 0.56 0.47 ## PKNIFEDEFAULT_2018_rate_log 0.58 0.42 0.53 ## VANDRPT_2018_rate_log 0.54 0.47 ## VIORDRDEFAULT_2018_rate_log 0.67 0.45 ## ABIP_2018_rate_log 0.56 0.68 ## DISTRBDRUNKS_2018_rate_log 0.72 ## DISTRBPANHAN_2018_rate_log 0.70 ## IVPERLEWD_2018_rate_log 0.82 ## PERGUNDEFAULT_2018_rate_log 0.53 0.63 ## PSHOTDEFAULT_2018_rate_log 0.96 ## SHOTSDEFAULT_2018_rate_log 0.40 0.58 ## AB......_2018_rate_log ## ABDWIP_2018_rate_log ## EDP2_2018_rate_log 0.42 0.42 ## FDWEAPGUN_2018_rate_log ## VANDIP_2018_rate_log 0.43 0.49 ## ## Factor1 Factor2 Factor3 ## SS loadings 4.07 3.81 3.56 ## Proportion Var 0.20 0.19 0.18 ## Cumulative Var 0.20 0.39 0.57 ## ## Test of the hypothesis that 3 factors are sufficient. ## The chi square statistic is 163.82 on 133 degrees of freedom. ## The p-value is 0.0359
Part III. Strucutral Equation Model
Thus, our last step was to slightly modify these categorizations to make more substantive sense, an organization that we test using structural equation modeling. The results, as shown below, demonstrate a rather strong fit of these organization of the case types into four factors: public social disorder, private conflict, public violence, and prevalence of guns. We treat each as an aspect of a neighborhood (or a latent factor, in structural equation lingo) with a variety of case types that arise from it (or manifest variables).
require(lavaan)
model<-'
guns =~ ABDWIP_2018_rate_log + PERGUNDEFAULT_2018_rate_log + PSHOTDEFAULT_2018_rate_log + SHOTSDEFAULT_2018_rate_log + FDWEAPGUN_2018_rate_log
soc_disorder =~ DISTRBDRUNKS_2018_rate_log + DISTRBPANHAN_2018_rate_log + IVPERLEWD_2018_rate_log + VANDIP_2018_rate_log
private_conflict =~ BEIP_2018_rate_log + VIORDRDEFAULT_2018_rate_log + LANTENDEFAULT_2018_rate_log + VANDRPT_2018_rate_log
public_viol =~ ARMROBDEFAULT_2018_rate_log + ABIP_2018_rate_log + ABRPT_2018_rate_log + AB......_2018_rate_log + PKNIFEDEFAULT_2018_rate_log + FIGHTDEFAULT_2018_rate_log + EDP2_2018_rate_log
'
sem_fit <- sem(model, data=test)
summary(sem_fit,standardized=TRUE)
require(semPlot)
semPaths(sem_fit, whatLabels='stand',layout='circle2',intercepts=FALSE,residuals=FALSE, nCharNodes=6)
In Conclusion
We have built our four main ecometrics for crime and social disorder based on these categorizations. We sum all cases pertaining to these case types and then divide by population to calculate rates. We release them on the Boston Data Library and BostonMap annually and encourage people to use these metrics in their own work wherever useful. This year, however, we also want to encourage people to turn the tract-by-tract case type frequencies into their own novel ecometrics, with this blog post as a guide to one approach to doing so.