Big Data Simpsons
Thanks to the work of Benjamin M. Schmidt, an assistant professor of history at Northeastern University, 25 years of dialogue from The Simpsons have been smashed into a giant data set, connected to a user-friendly search window.
The results appear as a graph showing how frequently those words were used in each one of the show’s seasons. The web page — titled Bookworm: Simpsons — even allows users to click on any point in the resulting graph to pull up a list of the actual lines of dialogue containing their search word.
So what do we learn? Well, for one thing, the word “doughnut” appeared less and less frequently as the show progressed. Though this appears to be partially explained by the fact that someone started spelling it “donut.”
And while Bart Simpson brought new popularity to the phrase “Don’t have a cow,” the show’s actual usage of the word “cow” peaked in 2005, while its use of Bart’s catch-phrase “caramba” still continues to climb.
Though surprisingly, the use of the word “beer” has been dropping fairly steadily over the entire run of the series — moving almost in sync with the show’s use of the word “Duff,” the fictitious brand of beer favored by Homer Simpson.
There’s some serious science behind all this. Schmidt is part of the core faculty at the Northeastern University’s NuLab for Texts, Maps and Networks (a center for “Digital Humanities and Computational Social Science.”) And he also teaches classes on U.S. history, digital history, and the history of big data (for both undergraduates and graduate students).
Schmidt describes himself as a digital humanist, adding “much of my work explores the way historians (and anyone else who wants to tell a story) can use massive digital archives to communicate in new and old ways through data analysis, visualization, and algorithmic transformations.”
Mixing Interactive Datasets with Crowdsourcing
And analyzing word frequency on the Simpson’s is only his latest side project.
Schmidt has also examined the “State of the Union” addresses of America’s presidents, creating various interactive charts showing how often key words were used by each president. For 130 years the word “majesty” appeared in the addresses of every president except Thomas Jefferson, Zachary Taylor, and Benjamin Harrison — up through Warren G. Harding’s address in 1921, after which it disappeared for the next 44 years. And William Taft’s address in 1909 was the last one to contain the word “Indians.”
He’s studied everything from college majors to baseball statistics. But more importantly, he’s also one of the co-creators (and co-directors) of a tool called Bookworm, which creates data visualizations from repositories of digital texts — for example, books, newspapers, or scientific publications — with hosting supported by the Open Science Data Cloud. And Bookworm can also access a searchable database of dialogue from movies and TV shows (using subtitles from the popular site OpenSubtitles.org) — which is what ultimately powered the Bookworm: Simpsons page.
Schmidt describes it as “a particularly trivial example of how a Bookworm Browser can open up your texts.” In a larger sense, it’s a striking example of “culturomics,” which has been described as “the application of high-throughput data collection and analysis to the study of human culture.”
For example, Harvard’s Cultural Observatory says its working to “enable the quantitative study of human culture across societies and across centuries” by pursuing a three-pronged approach:
- Creating massive datasets relevant to human culture.
- Using these datasets to power wholly new types of analysis.
- Developing tools that enable researchers and the general public to query the data.
So we may be seeing more analyses like these in the years to come.
It was interesting to see the reactions to Schmidt’s latest project from non-academics on Hacker News. “I find it interesting how ‘Homer’ has experienced a constant, steady decline in usage over time,” wrote one commenter — prompting another to do an even more detailed experiment. It’s an interesting example of the results when an interactive dataset meets crowdsourced experimentation.
- Self-driving cars have a new problem in Australia: Kangaroos.
- A singer-songwriter writes loves songs to physics.
- IBM’s Watson calculates that Elon Musk isn’t much of a risk taker.
- NASA’s Sue Finley was first hired in 1958 as a human computer.
- How the UK reacts to a cyberattack.
Feature image: Bartman sculpture, by Nancy Cartwright, who is the voice of Bart Simpson. NYC.