As a NULab Research Fellow, I worked on a supervised digital humanities project as part of my duties with the NULab and the Digital Integration Teaching Initiative. To this end, I completed a project using the digital humanities method of topic modeling to analyze headline news coverage about the anti-Vietnam War movement between 1964 and 1966. In February 1965, President Lyndon Johnson escalated United States military presence in Vietnam and called up 50,000 draftees. This escalation quickly galvanized the peace movement because the Vietnam War would consequently have a more direct impact on young people’s lives. Although students traditionally received “2-S” deferments from the draft, President Johnson would revise this policy in 1965 to deny deferments to students below a certain class rank and score on an aptitude test administered by the Selective Service. Even students who maintained their deferments had friends and family affected by Johnson’s military escalation. Both young and old had deep political or philosophical commitments that guided their activism even if their personal lives were not directly upended by the war.
Headline news shaped how a reading public understood the anti-Vietnam War movement at this time. Unlike editorials or advertisements that have a more obvious angle they are attempting to forward, headline news is ostensibly an impartial recounting of past events. But headline news is not neutral; journalists have to make choices about the selection, arrangement, and prioritization of data—all of these choices can be studied through the context of the “topics” that appear in the headline news from this period.Similarly, my own decisions about the data that went into my corpus were not objective. One rewarding aspect of pursuing digital humanities has been learning to detect biases in other people’s data (a reflective process) while understanding my own biases when creating data (a reflexive process).
Topic modeling, a machine learning method that has been adopted by humanists, can improve our close readings by computationally generating patterns in words from large amounts of text. I used MALLET (Machine Learning for Language Toolkit), a program created out of UMass Amherst, to generate word associations in a collection (corpus) of 118 text files from PDFs of newspaper articles about my subject. I downloaded the first 100 results that appeared in a search for “Vietnam War” and “protest” between 1964 and 1966 on the ProQuest Historical Newspapers database. Then I collected 18 more articles from an interactive timeline by the Mapping American Social Movements Project; I was able to find all of these through ProQuest. I stuck to a limited number of files in order to complete a manageable project, but in future phases of the project I will be expanding the corpus beyond 1964-1966, while including more articles within those two years.
MALLET runs from the command line (the ultimate “back-end”) on your computer, and it produces results pretty quickly. I learned how to use it by following the steps of this tutorial by The Programming Historian. But there is also a Topic Modeling Tool that runs MALLET through a Graphical User Interface (GUI), where you only need to change some settings and click a few icons. I originally ran MALLET through the command line to get more familiar with the back-end, but eventually used the Topic Modeling Tool because it generated clean CSV data that helped me create some cool visualizations.
My results were very exciting, despite this project being largely exploratory in nature. I trained the topic model to generate 20 topics. After MALLET generated my topics, I had to come up with unique topic labels. The program does not know what the word clusters mean, only that they appear together frequently, so I had to use close reading to categorize each topic according to the patterns that MALLET identified. The full list of topics, and the labels I ascribed to them, can be found on my personal website where the project lives. The optimal or “natural” number of topics is an important question in the topic modeling community. The Programming Historian suggests that you train MALLET to generate twenty topics, but also that you may need to “cycle through a number of iterations” with fewer or greater numbers of topics. If you only train the topic model to generate twenty topics, and the majority of your texts are represented in all twenty topics, you may need to increase the number of topics to discover any true variations in the corpus. Personally, I chose twenty topics because when I attempted to have MALLET generate just ten topics, it was harder for me to ascribe unique topic labels. For instance, after MALLET generated 20 topics, “topic 0” (the numbers are just for classification and not representing most/least frequent) had the following words: “pickets napalm mrs dow picket fighting workers chemical library ceremony picketing board exchange company orange gruening unit wing plastic ion.” This is very clearly a “picketing” topic, with some interesting context from the pickets against Dow Chemical for manufacturing the napalm that the United States military used to harm civilians in Vietnam. But when I had MALLET generate 10 topics from the same corpus, the words in topic 0 were: “morrison fast death artists tower art pentagon stern church portland poets bly di children wife man suvero quaker defense suicide.” I would not know whether to label this a “Protest Art” topic or a “Suicide/Immolation” topic (the latter because of “quaker” and “suicide”).
In this post I will focus on the visualizations I produced in Excel from the results of my topic model. With CSV data that indicated how frequently each topic appeared in each text file of my corpus, I created line graphs that represented headline news about the anti-Vietnam War movement between March 1964 (the earliest month/year in my corpus) and December 1965. Out of my 118 text files, 44 belonged to this timeframe. I narrowed the scope of my visualizations in order to focus more clearly on month-to-month changes in reporting that may be visible in a newspaper-based corpus.
Even with a smaller sample of text files, I was able to conclude that topic modeling can be a useful method for visualizing change over time according to “topics over time.” I borrow this phrase from Ben Schmidt, who argues that the “camel-backed curves” of a visualization, familiar to users of the Google Ngram tool, can unfairly penalize historical changes that are “discontinuous” and “cyclical.” Relatedly, Ted Underwood argues that visualizations based upon topic models can “warp time” by misleading us into believing that some changes were faster or slower simply by how the peaks and valleys appear. I went into the project aware of these criticisms of topic modeling, but I’ve concluded from my results that topic modeling holds a lot of promise for evaluating my corpus and guiding thoughtful conclusions.
The clearest interpretation I could make from the topic model was based upon the topic I labeled “Students and the Draft.” The top words in this topic were: “students draft university college student test school sit high faculty colleges society chicago selective tests deferment class building began stu” (the last word was cut off in the results, but presumably was a repeat of “student” – this is an example of the randomness or potential incoherence of MALLET). After visualizing the frequency of the topic in my corpus on a month-to-month basis (as each text file was named by date), I could more clearly understand how responsive the anti-war movement was to President Johnson’s announcement to call up 50,000 draftees in February 1965. There is a very clear increase in headline news coverage of student activism against the draft after February 1965. Individual photos of each of the visualizations can be looked at more closely here.
From the visualizations and topic outputs that I interpreted, I was able to stake out a position in methodological debates about topic modeling and further my engagement with digital humanities as a key aspect of my historical training. In a future iteration of this project, I will be expanding my corpus to include headline news coverage of the anti-Vietnam War movement up to the war’s end in 1975 and representing the results with more advanced visualization tools.