Addressing the Black Boxes of Machine Learning

My time at NULab coincided with my efforts to write my dissertation’s methods section, and I was fortunate to be able to pursue a fellowship project that was meaningful to both at once. My research has incorporated a specific machine learning technique that enables large quantities of text to be modeled in vector space. Essentially, this means that each word in the text is assigned a fixed address in a multidimensional space, with numbers assigned to each which can be used to indicate their semantic similarity to other words.

This technique, known as word vectors or word embedding models (WEMs), provides researchers with a unique way to investigate various components of semantic similarity by “embedding” words in multidimensional models. My research, dealing with the political and nationalistic manifestations of American far-right ideologies, benefitted tremendously from the ability to impose a mapping schema on massive quantities of things those far-right individuals wrote.

My introduction to text-as-data and Natural Language Processing (NLP) came from the Women Writers Project’s Word Vectors for the Thoughtful Humanist seminar in May 2022. The materials and discussions from the seminar went a long way toward explaining a very opaque concept to me, serving as everything I needed to get my research off the ground in the beginning. However, when I went into the literature to augment my understanding of how vector space actually worked and what the techniques I was using actually did, I realized that the sense of opacity was still great–almost impenetrable.

It struck me that researchers who were entering the realm of NLP for the first time, especially those coming from disciplines without strong backgrounds in digital tools, might be scared off by this initial opacity. And if researchers are hesitant to do more than scratch the surface of what WEMs can do, it becomes all the more difficult to conduct more than exploratory research with their models. WEMs have the ability to investigate serious research claims beyond those that just show interesting patterns in data, and making their underlying logics more accessible could help researchers better leverage text-as-data techniques in more nuanced research claims.

From this came the impetus for this fellowship project. This project attempts to address this endemic opacity by offering some basic explanations of key components of text-as-data techniques. In pursuing my dissertation research, I earmarked several components of WEMs that “made sense” when I was filling in R code, but that had a deeper conceptual basis that I didn’t quite understand. In this fellowship project, I went back into the literature to try to better understand those initial earmarked components–concerns like how corpus size and preprocessing decisions made as preliminary research decisions actually work to impact the resulting model.

This project provides what I hope is a useful preliminary template for researchers looking to understand how their WEMs work on a slightly deeper level. It can function as the early template for a baseline that large-scale projects such as the Women Writers Project could build on when explaining their techniques to novice users, especially those from disciplines outside the digital humanities.

I worked toward two interconnected aims in this project. The first involved providing some measure of simplification for researchers, placing theory in plainer language than is often used in the literature. This project’s second aim is to provide a very preliminary literature review of how text-as-data is applied in practice. In deciding which literature to consult, I built the core of an annotated bibliography by using six papers and one book from across the disciplines of the digital humanities and computational social science. With WEMs as an entry point, this project surveys each of the published works for various elements of their approach to machine learning generally and WEMs specifically. I intended, by laying these components out in plain language, to provide an easy-to-digest breakdown of some of the fundamental components that must be considered before a researcher can begin their text-as-data project.

One interesting unintended consequence of this deep dive into the literature of computational social science and digital humanities with WEMs in mind was that I stumbled upon the roots of a disciplinary divergence. Conventional wisdom (as it were) in the text-as-data world holds that, when you’re building corpora to feed your models, the more words you have, the more representative the models have the ability to be. One piece of the surveyed work from the social sciences challenged the traditional conventional wisdom about corpus size, pioneering a new technique that yields replicable results.

The potential inherent in this technique (and future similar developments) could provide a huge opportunity for researchers like me. While my research contains a substantial amount of text by human standards, it isn’t large by machine learning or NLP standards. The literature argues that in some (but by no means all) instances, social scientists might deal with smaller corpora, meaning that any methodology that caters to our particular needs is very welcome. For instance, a corpus of court briefs dealing with a specific case or series of cases may necessarily be of limited size. As machine learning techniques grow in popularity, it is very likely that new techniques will be able to expand on the conventional wisdom in the digital humanities and computational social science scholarship, thus guiding the way for scholars in other disciplines to make the techniques match their own workflows–and target corpora–better.

In this project, I sought to begin a clarification process that is necessarily incomplete. As methods evolve, our work to explain them to new researchers must also evolve. This project has begun this process, providing both my own dissertation research and hopefully future researchers with the groundwork for understanding what goes on in the black boxes of WEMs a little better. My investigation into the processes of WEMs has given me a greater understanding of not only the methods that I used in my dissertation, but also the potential of such methods to be used in social science research going forward.

The preliminary annotated bibliography, up to date as of April 2023, can be found here.

Addressing the Black Boxes of Machine Learning

People in this story

More Stories

Renewable Messaging of Fossil Fuel Companies: Corporate Communication Strategies in Authoritarian Regimes

THE BROWNIES’ BOOK: Mapping Black Boston Education Histories

Encoding A Narrative of the Life of Mrs. Mary Jemison for the Women Writers Project: NULab + DITI Research Project