by Will Pfeffer, Riley Tucker, and Edgar Castro
We built the @BARIexplorer Twitter bot last year, as a way to showcase some of the open datasets that we make available through the Boston Data Portal in a new way. The first version of the bot would pick a parcel in Boston at random and tweet out some information from the Boston Tax Assessment database, along with an image from Google Street View:
We continuously update and add data to the Boston Data Portal, and we’ve continued to update @BARIexplorer with those data. Version 2 added a reply to the initial tweet, this time looking at the availability of public transit near any given parcel: walking time to the nearest MBTA bus or subway stop, along with the number of transit lines that serve the neighborhood in which the parcel is located:
One of the datasets which we update each year is the American Community Survey (ACS), an ongoing survey conducted by the Census Bureau which reaches around 1% of the US population annually (roughly 3.5 million people). By combining these annual samples across five-year periods, the federal government is able to gather very granular, up-to-date information about the country’s population. Also important, because the ACS only surveys a sample of Americans, the Census can release highly detailed information about neighborhoods without risk of identifying surveyed individuals. Given these strengths, the ACS makes up a big part of the demographic data we and other researchers use in our work.
Because the ACS provides such a wide range of information, we decided to build in not one, but three new features. We have kept the initial Tweet the same, providing data from the Boston Tax Assessment data, but added three new options for the reply. The bot will now choose at random whether to reply with the transit access tweet, or one of the new categories: education levels and age group distribution in the census tract; density, home ownership, and median rent in the census tract; and racial/ethnic heterogeneity (diversity) in the census tract.
To measure education, we have combined a series of education-related variables to create three measures: percentage of neighborhood residents with a high school degree or less, percentage who attended some college or earned a bachelor’s degree, and percentage who have earned a graduate degree (master’s degree, professional degree, or doctorate). Additionally, we have created four variables to measure age distribution within neighborhoods: percentage of residents under the age of 18, percentage between the ages of 18 and 34, percentage between the ages of 35-64, and percentage over the age of 65. To create bar graphs, we use the pandas Python library to load the tabulated data and use the DataFrame.plot.bar function included in pandas to create the graphs. The resulting figure, which is rendered using matplotlib, is then tweaked using various functions included in the matplotlib.pyplot module:
Density, homeownership, and median rent are also measured using variables provided by the ACS. Population density is measured as the number of residents per square mile. Because the ACS provides data on the percentage of residents who rent their home, homeownership is measured as the complement of the percentage of neighborhood residents who rent their home (100% minus the percentage of renters). Median rent is measured using information on gross rent, which represents the value of a renter’s contract rent plus the estimated average monthly cost of utilities. To make graphs for these data, we used the pandas package to read and join the tabulated data set with tract-level shapefiles obtained from the U.S. Census TIGER/Line database, which were read using the geopandas package. Once the data have been linked to geometries, we then use a combination of plotting routines included in geopandas, basemap functionality provided by contextily, and facilities for drawing arbitrary, geospatially-referenced shapes provided by descartes (which is used by geopandas internally). The resulting figure, showing each geometry colored according to rent, is referred to as a choropleth map:
Finally, we calculate ethnic/racial diversity by considering the proportional demographic representation of different racial groups within the same neighborhood. The ACS provides measures detailing the percentage of residents who self-identify as white, Black, Hispanic, and Asian. Using these proportions, we have calculated a measure of racial/ethnic heterogeneity using the Herfindahl index, which generates a measure that represents that probability that two neighborhood residents, selected at random, will belong to different racial/ethnic groups:
dat$EthHet<- 1-(dat$White^2 + dat$Hispanic^2 + dat$Black^2 + dat$Asian^2)
To generate graphs for these data, we use the same process for generating choropleth maps that was described above, this time coloring by ethnic heterogeneity instead of by rent. This process is generalizable to all kinds of choropleth maps:
We hope that @BARIexplorer is a fun way to learn more about both the city of Boston and a variety of publicly available datasets which describe it! If you have questions, comments, or suggestions, we’d love to hear them; send us an email at BARI@northeastern.edu, or just tweet at the bot! And as always, our code is available on Github if you’d like to take a look under the hood.