Skip to content

This page includes datasets that are structured and unstructured, numerical and textual, for a range of methods and disciplines.

Datasets

  • Boston Housing Data – Mapping data specific to Boston and its suburbs.
  • GIS Datasets – Links to a variety of GIS datasets assembled for the Northeastern Library’s GIS subject guide.
  • GIS shapefiles database – Repository of mapping data from a variety of sources and agencies.
  • United States Census Data – Geocoded data from the United States Census Bureau.
  • USGS GIS data – Source of mapping data provided by the United States Geological Survey.

Curated Resources

Text and Data Mining – A subject guide by Amanda Rust on “Text and Data Mining Library Databases” from Northeastern University Library.

Communalytic  – A social media data capture tool that allows users to store API keys and download data from a variety of social media websites.

Wikimedia Data Tutorial: Using public data from Wikipedia and its sister projects for academic research – This publicly available tutorial helps individuals understand the technical concepts around working with wiki data, potential pitfalls, and best practices and recommendations from researchers who have been working to increase access to this body of information. 

Scraping Reddit the Right Way: A Guide to Legal and Ethical Data Collection with RedditHarbor  – This guide introduces learners to ethical and legal methods for collecting Reddit data.

Collection of NLP Datasets – Github repository by Nicolas Iderhoff of free and public domain datasets with text data for use in Natural Language Processing. Primarily raw, unstructured texts.

Demonstration Corpora, by Alan Liu, including:

  • Abraham Lincoln Speeches and Letters – Corpus assembled by Alan Liu (see website for .zip file download link and metadata).
  • Adult British Fiction – Literature from the 1880s, sorted by author gender (see website for .zip file download link and metadata).
  • American Presidency Project – U.S. Presidents’ Inaugural Speeches, States of the Union, Campaign Platforms, and other presidential text material.
  • Book Summaries and Film Summaries from Wikipedia – Demo text assembled by David Bamman of the UC Berkeley School of Information (see website for .zip file download link and metadata).
  • Children’s Fiction – Children’s literature from the 1880s, sorted by author gender (see website for .zip file download link and metadata).
  • Feeding America – Michigan State University Library’s collection of historical American cookbooks spanning the late 18th century to early 20th century.
  • Grange Visitor – Michigan State University Library’s collection of The Grange Visitor, the official newspaper of the Michigan State Grange published between 1875-1896.
  • Sunday School Books in 19th Century America – Michigan State University Library’s collection of Sunday school books published between 1809 and 1887.
  • U.S. Patents Related to the Humanities – Patents mentioning ‘humanities’ or ‘liberal arts’ between 1976-2015, located through the U.S. Patent Office (see website for .zip file download link and metadata).
  • Writings of William Wordsworth – Writings assembled by Alan Liu (see website for .zip file download link and metadata).

List of sites containing full-text books:

    • Internet Archive Books – Includes plain-text access to books, issues of magazines, etc.
    • Oxford Text Archive – A large number of texts available in a variety of forms, including plain text; texts are accessed one at a time.

Springboard List of Free Datasets for Data Science

  • Bureau of Economic Analysis – National and regional economic data including GDP and exchange rates.
  • Bureau of Labor Statistics – Important economic indicators for the United States, including unemployment and inflation, that can be segmented temporally or spatially.
  • CDC Cause of Death – Database of causes of death provided by the Center for Disease Control.
  • Data is Plural – Data is Plural is a weekly newsletter of “useful/curious” datasets. This google doc has the name and location of every dataset listed in the newsletter. The datasets cover everything from global foreign aid to Donkey Kong scores.
  • Dow Jones Weekly Returns – Stock price weekly returns from the Dow Jones Industrial Average.
  • Enron Emails – Text data of emails from the fraudulent energy company Enron.
  • FBI Crime Data – Time series crime data reported by the FBI at national and jurisdictional levels.
  • IMF Data – International financial data from the International Monetary Fund.
  • Medicare Hospital Quality – Database on hospital quality of care for hospitals across the United States.
  • SEER Cancer Incidence – Cancer data that can be sorted by gender, race, year, and other demographics.
  • United States Census Data – Statistics provided by the United States Census Bureau.

Corpora from Miriam Posner’s Crowdsourced Document:

A collection of “demo corpora” ready to use for text analysis, along with some related tools, at the “DH Toychest” site curated by Alan Liu.

Visit the “Responsible Datasets in Context” site to explore a variety of resources on the social and historical context of data that is essential for all responsible data work.

The NU Library has a set of “Text and Data Mining” resources that includes a list of “Openly Available Text Sources.”