The Good, the Bad, and the Honest in Open Science: Refurbishing the BARI Geographical Infrastructure

This week we are releasing the 2019 update of BARI’s Geographical Infrastructure for the City of Boston (GI). The GI is a critical resource for enabling all of BARI’s research in that it organizes the city into 17 nested levels of geography—from parcels to streets to census geographies to administrative boundaries—providing .csvs and shapefiles for each. Simply put, the GI enables researchers to analyze, visualize, and merge any data set that references any of these geographic levels of organization. Further, it provides a seamless way to aggregate data from one level to any of its containing levels, say, counting 311 reports on a street or calculating the average value of properties in a census tract. Given that this tool is so central to all of our work, we share it with everyone else doing research on Boston through our Boston Data Portal.

These blog posts typically highlight a particular aspect of the data set that we see as important to its usage, or something new that we added to a particular resource this year. Instead, I want to share an unfortunate discovery—-that there were non-trivial mistakes in the 2018 release of the Geographical Infrastructure-—what it means for those who have used the data, and what we have done to fix the problem.
First, the problem. I was using the GI to tabulate crimes on streets, something that we do with some frequency as part of our work on problem properties, and I discovered a pretty outrageous number of crimes on one street in particular. This led me to wonder how many parcels were on that street:

require(tidyverse)
parcels_18_orig<-read.csv(‘Land.Parcels.2018.csv’)
street_counts_18<-parcels_18_orig %>%
group_by(TLID) %>%
summarise(n_distinct(Land_Parcel_ID))

require(sf)
require(ggmap)
require(ggplot2)

streets_geo<-st_read(dsn=”C:/Users/bariuser4/Documents/Dan/INC0511841/Documents/Research/Boston-Radcliffe/Geographical Infrastructure/Geographical Infrastructure v. 2014_ Final Folder/Roads/Roads_Boston_2013_BARI.shp”)

streets_geo<-merge(streets_geo,street_counts_18,by=’TLID’,all.x=TRUE)
names(streets_geo)[24]<-‘parcels’

summary(streets_geo$parcels)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
## 1.000 3.000 5.000 8.391 10.000 8224.000 13437

Yes, there was a street with over 8,000 parcels. Let’s zoom in on that street:

streets_geo_subset<-streets_geo[streets_geo$CT_ID_10==streets_geo$CT_ID_10[which.max(streets_geo$parcels) ] & !is.na(streets_geo$CT_ID_10),]

tract_zoom<-get_map(location=c(left = st_bbox(streets_geo_subset)[“xmin”][[1]],
bottom = st_bbox(streets_geo_subset)[“ymin”][[1]],
right = st_bbox(streets_geo_subset)[“xmax”][[1]],
top = st_bbox(streets_geo_subset)[“ymax”][[1]]), source=’stamen’)
tract_zoom_map<-ggmap(tract_zoom)

tract_zoom_map + geom_sf(data=streets_geo_subset, aes(color=parcels), size=1.5, inherit.aes=FALSE)

By all evidence, this is a pretty non-descript street in Roxbury, and this map would suggest it has far fewer parcels on it. In any case, the next highest number of parcels on a street was 76. Something was wrong.

We looked closer at the parcels that had been attached to this parcel and they had all sorts of street names. A lot of them, however, were street names that we knew from previous experience to be problematic for geocoding, including streets with numbers in them (e.g., 5th Street), words that might or might not be abbreviated (e.g., Mount vs. Mt.), and full names of honored individuals or groups (e.g., Veterans of Foreign Wars Pkwy vs. VFW Pkwy).

This gave us a clue to what was going on based on the new process we introduced in the 2018 version for connecting parcels to streets. The City maintains a shapefile of parcels, meaning the latitude and longitude of every parcel is already known. Thus, instead of a classic geocode, which uses the street number and name to estimate where a parcel actually sits, we were able to do something simpler and in fact more precise. We attach each parcel to the nearest street sharing its name.

It turns out that a mistake in our code was attaching any parcel whose name was not present in the list of streets by default to this single street segment in Roxbury. So, we went through these and found about 400 street names that had no match. We then wrote code that adjusted as many of these as possible to have matches (e.g., “Mount” “Mt.”). We then fixed our parcel-to-street joining process to not create any erroneous default matches. Last, we attached any unmatched parcel to the nearest street segment, provided that street segment was within 50m; we decided that anything further was too far away to be considered a defensible match.

The final product is a new parcel database in which all but 81 parcels (0.08%) are attached to streets and their distribution is much more reasonable:

parcels_final_new_18_3<-read.csv(‘C:/Users/bariuser4/Downloads/Parcel_final_2018_13112019.csv’)

street_counts_18<-parcels_final_new_18_3 %>%
group_by(TLID) %>%
summarise(n_distinct(Land_Parcel_ID))

streets_geo<-merge(streets_geo,street_counts_18,by=’TLID’,all.x=TRUE)
names(streets_geo)[24]<-‘parcels’

summary(streets_geo$parcels)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s
## 1.00 2.00 5.00 7.06 9.00 76.00 11002

ggplot() + geom_sf(data=streets_geo, aes(color=parcels))

The question you are probably wondering now: What does this mean? Have people working with the GI been generating incorrect results? The answer is less problematic than it might seem. First, the process of attaching parcels to street segments was independent of any other aspect of the development of the GI. Most importantly, the attribution of parcels to census geographies is done through a separate process, meaning an incorrect street segment would have no bearing on the other geographic classifications of a parcel. So, the only concern is for people who conduct analyses at the street segment level. In those cases, the issues are these: the one street in Roxbury that had was the default for unmatched parcels was acting as an outlier; and about 2,000 other streets were missing parcels. Some of these streets were originally classified as not having any parcels. Anecdotally, I have re-run a set of regressions–from the aforementioned analysis on counts of crimes–that were part of the analysis that uncovered this issue and they came out almost identical after the fix.

We felt it was important to summarize these mistakes for you all, and also saw it as an opportunity to be transparent about the process behind the construction of the GI. We have released the 2019 version as well as an amended 2018 version. We are also cautiously optimistic that this has not had bearing on very many people’s research, but please do reach out to us if you are worried or uncertain. After many algorithmic and manual checks (and re-checks), led by BARI team members Alina Ristea (our new postdoc) and Saina Sheini (a PhD Candidate in Public Policy), we are confident that the current version is thoroughly accurate and bid you happy analysis.

The Good, the Bad, and the Honest in Open Science: Refurbishing the BARI Geographical Infrastructure

People in this story

More Stories

Research Spotlights 2022

Research Spotlights

BARI Conference 2023: Greater Boston’s Annual Insight-to-Impact Summit