Skip to content

This page includes datasets that are structured and unstructured, numerical and textual, for a range of methods and disciplines.

Datasets

  • Boston Housing Data – Mapping data specific to Boston and its suburbs.
  • GIS Datasets – Links to a variety of GIS datasets assembled for the Northeastern Library’s GIS subject guide.
  • GIS shapefiles database – Repository of mapping data from a variety of sources and agencies.
  • United States Census Data – Geocoded data from the United States Census Bureau.
  • USGS GIS data – Source of mapping data provided by the United States Geological Survey.

Curated Resources

Text and Data Mining – A subject guide by Amanda Rust on “Text and Data Mining Library Databases” from Northeastern University Library.

Collection of NLP Datasets – Github repository by Nicolas Iderhoff of free and public domain datasets with text data for use in Natural Language Processing. Primarily raw, unstructured texts.

Demonstration Corpora, by Alan Liu, including:

  • Abraham Lincoln Speeches and Letters – Corpus assembled by Alan Liu (see website for .zip file download link and metadata).
  • Adult British Fiction – Literature from the 1880s, sorted by author gender (see website for .zip file download link and metadata).
  • American Presidency Project – U.S. Presidents’ Inaugural Speeches, States of the Union, Campaign Platforms, and other presidential text material.
  • Book Summaries and Film Summaries from Wikipedia – Demo text assembled by David Bamman of the UC Berkeley School of Information (see website for .zip file download link and metadata).
  • Children’s Fiction – Children’s literature from the 1880s, sorted by author gender (see website for .zip file download link and metadata).
  • Feeding America – Michigan State University Library’s collection of historical American cookbooks spanning the late 18th century to early 20th century.
  • Grange Visitor – Michigan State University Library’s collection of The Grange Visitor, the official newspaper of the Michigan State Grange published between 1875-1896.
  • Sunday School Books in 19th Century America – Michigan State University Library’s collection of Sunday school books published between 1809 and 1887.
  • U.S. Patents Related to the Humanities – Patents mentioning ‘humanities’ or ‘liberal arts’ between 1976-2015, located through the U.S. Patent Office (see website for .zip file download link and metadata).
  • Writings of William Wordsworth – Writings assembled by Alan Liu (see website for .zip file download link and metadata).

List of sites containing full-text books:

    • Internet Archive Books – Includes plain-text access to books, issues of magazines, etc.
    • Oxford Text Archive – A large number of texts available in a variety of forms, including plain text; texts are accessed one at a time.

Springboard List of Free Datasets for Data Science

  • Bureau of Economic Analysis – National and regional economic data including GDP and exchange rates.
  • Bureau of Labor Statistics – Important economic indicators for the United States, including unemployment and inflation, that can be segmented temporally or spatially.
  • CDC Cause of Death – Database of causes of death provided by the Center for Disease Control.
  • Data is Plural – Data is Plural is a weekly newsletter of “useful/curious” datasets. This google doc has the name and location of every dataset listed in the newsletter. The datasets cover everything from global foreign aid to Donkey Kong scores.
  • Dow Jones Weekly Returns – Stock price weekly returns from the Dow Jones Industrial Average.
  • Enron Emails – Text data of emails from the fraudulent energy company Enron.
  • FBI Crime Data – Time series crime data reported by the FBI at national and jurisdictional levels.
  • IMF Data – International financial data from the International Monetary Fund.
  • Medicare Hospital Quality – Database on hospital quality of care for hospitals across the United States.
  • SEER Cancer Incidence – Cancer data that can be sorted by gender, race, year, and other demographics.
  • United States Census Data – Statistics provided by the United States Census Bureau.

Corpora from Miriam Posner’s Crowdsourced Document:

A collection of “demo corpora” ready to use for text analysis, along with some related tools, at the “DH Toychest” site curated by Alan Liu.

The NU Library has a set of “Text and Data Mining” resources that includes a list of “Openly Available Text Sources.”