Datasets

This page includes datasets that are structured and unstructured, numerical and textual, for a range of methods and disciplines.

Datasets

GIS Datasets

Boston Housing Data – Mapping data specific to Boston and its suburbs.
GIS Datasets – Links to a variety of GIS datasets assembled for the Northeastern Library’s GIS subject guide.
GIS shapefiles database – Repository of mapping data from a variety of sources and agencies.
United States Census Data – Geocoded data from the United States Census Bureau.
USGS GIS data – Source of mapping data provided by the United States Geological Survey.

Network Datasets

Duke Network Analysis Center – Database of network dataset repositories.
Networkrepository.com – Database of network data with interactive data visualization and analytical tools.
SNAP Stanford Large Network Database Collection – Collection of multiple network datasets including from social media like Facebook, LiveJournal, and Twitter.
UC Irvine Network Data Repository – Collection of network datasets used in previously published scholarly articles.

Plain Text

Early Caribbean Digital Archive (ECDA) – An open-access collection of pre-twentieth-century Caribbean texts, maps, and images
Project Gutenberg – A library of over 60,000 eBooks and texts, available as plain text and in other formats.

Political Science and Public Policy Datasets

American Community Survey (ACS) – A premier source for detailed population and housing information about the US.
American National Election Studies (ANES) – Time series survey data on American national elections dating back to 1948.
Analyze Boston – City of Boston’s open data hub.
BARI Data Portal – Data products from the Boston Area Research Initiative (BARI) projects.
Comparative Constitutions Project – Text database of global government constitutions. See also: Constitute.
Correlates of War Project – Armed-conflict and related data on national militaries, disputes, alliances, and territorial change, among other datasets.
CountLove – Dataset of protests in the United States since 2017 broken down by state.
EM-DAT: The International Disaster Database – Data on occurrence and effects of over 22,000 disasters in the world from 1900 to today assembled from various sources including the UN, NGOs, the media, research institutions, and private industry.
Freedom House – Dataset of annual global reports on political rights and civil liberties.
Global Peace Index – Dataset on peace ratings per country based on a variety of indicators including crime, militarism, the arms industry, and conflict.
Global Terrorism Database (GTD) – Information on more than 190,000 terrorist attacks worldwide including date, location, weapons, target, casualties, and identifiable parties responsible.
Harvard’s Caselaw Access Project – Text data on written American caselaw broken down by state and federal jurisdictions.
HUD User Datasets – A collection of data collected by the US Office of Policy Development and Research.
LegiScan – Database of US State (and Washington DC) legislation including bill status.
Metropolitan Area Planning Council (MAPC) Data Common – A single location to explore and download MAPC’s datasets.
Quality of Government – Dataset constructed of over 2,000 variables on global government quality in policy areas like health, environmental social policy, and poverty.
State of the Cities Data Systems (SOCDS) – Data for individual Metropolitan Areas, Central Cities, and Suburbs.
Stockholm International Peace Research Institute (SIPRI) – Datasets on the arms-trade industry and military expenditures.
Systemic Peace’s global polity datasets – Annual, cross-national dataset of “patterns of authority” regime characteristics of global governments since 1800. Other datasets are also available.
UNESCO Institute of Statistics (UIS) – Data for the Sustainable Development Goals across countries.
UN Human Development Index United Nations dataset on country-level measures of health, education, and economics.
Uppsala Conflict Data Program (UCDP) – Datasets on armed conflict that cover individual events of organized violence geocoded down to the level of individual villages, with temporal duration down to single days.
US Congress Bill Status XML Bulk Data – Data on the status of every bill in the United States’ Congress starting from the 113th Congress (2013-2015). See also the GitHub repository.
US Government Open Data – Multiple open datasets collated by different US Government agencies and departments.
World Bank EdStats (Education Statistics) – A comprehensive data and analysis source for key topics in education.
World Bank open data – Multiple open datasets of World Bank statistics.
World Values Survey – Survey of over 100 countries using a common questionnaire including time-series data on topics like economic development, democratization, religion, gender equality, and social capital.

TEI/XML

Documenting the American South – A collection of texts, images, and audio files related to southern history, literature, and culture.
Early English Books Online–Text Creation Partnership (EEBO-TCP): Navigations – Nearly 1500 texts related to the themes of travel and navigation in the early modern world (also available as plain text).
Eighteenth-Century Collections Online–Text Creation Partnership (ECCO-TCP) – Searchable SGML/XML-encoded texts from among the 150,000 titles available in Gale’s Eighteenth Century Collections Online (also available as plain text).
Women Writers Online – A full-text collection of early women’s writing in English, including full transcriptions of texts published between 1526 and 1850, focusing on materials that are rare or inaccessible.

Curated Resources

Data Mining

Text and Data Mining – A subject guide by Amanda Rust on “Text and Data Mining Library Databases” from Northeastern University Library.

Communalytic – A social media data capture tool that allows users to store API keys and download data from a variety of social media websites.

Wikimedia Data Tutorial: Using public data from Wikipedia and its sister projects for academic research – This publicly available tutorial helps individuals understand the technical concepts around working with wiki data, potential pitfalls, and best practices and recommendations from researchers who have been working to increase access to this body of information.

Scraping Reddit the Right Way: A Guide to Legal and Ethical Data Collection with RedditHarbor – This guide introduces learners to ethical and legal methods for collecting Reddit data.

Natural Language Processing

Collection of NLP Datasets – Github repository by Nicolas Iderhoff of free and public domain datasets with text data for use in Natural Language Processing. Primarily raw, unstructured texts.

Resources from Professor Laura Nelson’s “Analyzing Complex Digitized Data”

Demonstration Corpora, by Alan Liu, including:

Abraham Lincoln Speeches and Letters – Corpus assembled by Alan Liu (see website for .zip file download link and metadata).
Adult British Fiction – Literature from the 1880s, sorted by author gender (see website for .zip file download link and metadata).
American Presidency Project – U.S. Presidents’ Inaugural Speeches, States of the Union, Campaign Platforms, and other presidential text material.
Book Summaries and Film Summaries from Wikipedia – Demo text assembled by David Bamman of the UC Berkeley School of Information (see website for .zip file download link and metadata).
Children’s Fiction – Children’s literature from the 1880s, sorted by author gender (see website for .zip file download link and metadata).
Feeding America – Michigan State University Library’s collection of historical American cookbooks spanning the late 18th century to early 20th century.
Grange Visitor – Michigan State University Library’s collection of The Grange Visitor, the official newspaper of the Michigan State Grange published between 1875-1896.
Sunday School Books in 19^th Century America – Michigan State University Library’s collection of Sunday school books published between 1809 and 1887.
U.S. Patents Related to the Humanities – Patents mentioning ‘humanities’ or ‘liberal arts’ between 1976-2015, located through the U.S. Patent Office (see website for .zip file download link and metadata).
Writings of William Wordsworth – Writings assembled by Alan Liu (see website for .zip file download link and metadata).

List of sites containing full-text books:

- Internet Archive Books – Includes plain-text access to books, issues of magazines, etc.
- Oxford Text Archive – A large number of texts available in a variety of forms, including plain text; texts are accessed one at a time.

Springboard List of Free Datasets for Data Science

Bureau of Economic Analysis – National and regional economic data including GDP and exchange rates.
Bureau of Labor Statistics – Important economic indicators for the United States, including unemployment and inflation, that can be segmented temporally or spatially.
CDC Cause of Death – Database of causes of death provided by the Center for Disease Control.
Data is Plural – Data is Plural is a weekly newsletter of “useful/curious” datasets. This google doc has the name and location of every dataset listed in the newsletter. The datasets cover everything from global foreign aid to Donkey Kong scores.
Dow Jones Weekly Returns – Stock price weekly returns from the Dow Jones Industrial Average.
Enron Emails – Text data of emails from the fraudulent energy company Enron.
FBI Crime Data – Time series crime data reported by the FBI at national and jurisdictional levels.
IMF Data – International financial data from the International Monetary Fund.
Medicare Hospital Quality – Database on hospital quality of care for hospitals across the United States.
SEER Cancer Incidence – Cancer data that can be sorted by gender, race, year, and other demographics.
United States Census Data – Statistics provided by the United States Census Bureau.

Corpora from Miriam Posner’s Crowdsourced Document:

Australian Hansard – Database of Australian Parliamentary debates, 1901-1980.
BitCurator – Effort to develop tools to analyze features of digital texts.
BNC-Baby – Dataset of 4 million-word subcorpus of the 100 million-word British National Corpus, with part-of-speech tagging in XML.
BYU Corpora – Widely used corpora of American English.
Canadian Hansard – Database of debates & journals of the Canadian Senate & House of Commons.
Christian Classics Ethereal Library – Database of classic Christian texts.
Chronicling America – Database of 12.8 million pages of American newspapers.
Europe PMC – Repository of life sciences books, articles, and preprints.
Europeana Collections – Repository of many datasets from European libraries & archives, from papyri to photographs to newspapers.
Foreign Records of the US – Database of the nearly complete run of Foreign Relations of the United States; see these tools to obtain full text.
HathiTrust – Database of 16 million volumes, mostly in English.
Internet Archive – Collection of websites, texts, audio, and other media, available for bulk download via wget.
Media History Digital Library – Database of nearly 2 million pages of media-related books and articles, 1875-1995.
Movie Quotes Corpus – Database of 220,579 conversational exchanges between 10,292 pairs of movie characters.
NYT Annotated Corpus – Database of 1.8 million New York Times articles and New York Times-supplied metadata.
Old Bailey Online – Collection of 197,745 London criminal trials, 1674-1913.
Open Islamicate Texts Initiative – 10,000 premodern Islamicate texts. See also repositories.
Perseus Digital Library – Large collection of classical texts, much of it encoded in TEI/XML.
ToposText – Database of 557 classical texts linked with a gazetteer of the ancient world.
Transkribus Corpus and READ – Efforts to use computer vision to recognize handwriting.
Trove Australia – Database of 565 million documents collected by the National Library of Australia, including a sizable collection of newspapers.
Twitter Datasets – A catalog of Twitter datasets that are publicly available on the web.
UCLA Broadcast NewsScape – Database of 170K hours of captioned news programs; see Red Hen Lab for information on access.
UK Hansard – Database of UK Parliamentary debates.
Wright American Fiction – Database of American adult fiction, 1774–1900.

DH Toychest

A collection of “demo corpora” ready to use for text analysis, along with some related tools, at the “DH Toychest” site curated by Alan Liu.

Responsible Data

Visit the “Responsible Datasets in Context” site to explore a variety of resources on the social and historical context of data that is essential for all responsible data work.

Library Text Mining Resources

The NU Library has a set of “Text and Data Mining” resources that includes a list of “Openly Available Text Sources.”