Text Analytics

With so many desiccated combustibles cluttering its forest floor, a controlled burn of the humanities is overdue. Text analytics exploiting corpora of big data is destined to serve as this fire's helitorch. Text analytics represents a way of approaching cross-disciplinary research questions with a scientific or data-driven method. In some erotetic environments, the application of text analytics leads literally to use of the scientific method: constructing a hypothesis, testing with an experiment, analyzing the resulting data, and on to further efforts at disconfirmation. In other environments, the research process is less supervised and more exploratory, but is no less driven by data. 

Up in the ivory tower, some humanists will take a hard fall out of the armchair. If that is due to endemic social, cognitive, and disciplinary biases, so much the worse for those research programs. Text analytics will soon overturn a number of pet interpretations of classic texts. In fact, it already seems to have done so. For example, countless sinologists believe that the Early Chinese mind is 'holistic' and not 'dualist,' dualism being a Western category inapplicable to East Asia. But interdisciplinary researchers like Ted Slingerland and Maciek Chudek used text analytics with human coders [http://dx.doi.org/10.1111/j.1551-6709.2011.01186.x], for example, to gather data indicating that the standard humanities interpretation of the Chinese mind is erroneous. 

In an attempt to better understand the methods and limitations of text analytics research, my work in this area has ranged over a few different issues and languages. Justin Lynn, Ben Purzycki, and I converted some interpretations of the science fiction genre by literary scholars into testable hypotheses. We built a database of texts (controlling for year of publication, author gender, etc.) across science fiction, fantasy, and mystery, then we compiled those texts with Linguistic Inquiry and Word Count, Jamie Pennebaker's brilliant software tool. 

Kristoffer Nielbo and I, along with Carson Logan, Uffe Bergeton, and Ted Slingerland, are applying several resources toward answering research questions about a large database of historical Chinese texts. These questions include: Does the representation of the self in millions of characters across historical Chinese literature have markers that indicate affinities with results found in contemporary cross-cultural psychology showing that East Asian selves are more integrated with immediate family and communities than are other selves? Does historical China have the sort of supernatural agencies that the work of Ara Norenzyan and Azim Shariff would recognize as the high gods of cognitive science of religion, namely gods concerned with human morality, who have the power to punish, and who are able to monitor human behavior? We are at work answering these questions using topic modeling methods to identify and track through time latent associations between target characters. We have also developed another method that draws from corpus linguistics, which we call association modeling. Association modeling is allowing us to calculate and compare pinpoint associations between not only target terms but also specific uses of target terms.