Duke Machine Learning Seminar Series

An Improved Method of Automated Nonparametric Content Analysis for Social Science

Speaker:Gary King
Date: Wednesday, March 1, 2017
Time: 3:30pm - 5:00pm
Location: Ahmadieh Grand Hall (Gross Hall 330), Duke


A vast literature in computer science and statistics develops methods to automatically classify textual documents into chosen categories. In contrast, social scientists are often more interested in aggregate generalizations about populations of documents --- such as the percent of social media posts that speak favorably of a candidate's foreign policy. Unfortunately, trying to maximize the proportion of individual documents correctly classified often yields biased estimates of statistical aggregates. Fortunately, classification is neither a necessary nor always a desirable step in estimating aggregate proportions, as in the widely used nonparametric method developed in King and Lu (2008) and Hopkins and King (2010). In this paper, we first prove the properties of this methodology, develop ways around its weaknesses, and show how to improve its estimates in real applications. We then develop a unified approach to inference about statistical aggregates that uses this approach, along with the best classifiers for extrapolations when language changes over time, to produce better estimates than either method can accomplish alone. We evaluate our approach with analyses of 74 separate data sets.This talk is based on joint work with Connor Jerzak and Anton Strezhnev.


Gary King is the Albert J. Weatherhead III University Professor at Harvard University -- one of 24 with Harvard's most distinguished faculty title -- and Director of the Institute for Quantitative Social Science. King develops and applies empirical methods in many areas of social science research, focusing on innovations that span the range from statistical theory to practical application. King received a B.A. from SUNY New Paltz (1980) and a Ph.D. from the University of Wisconsin-Madison (1984). His research has been supported by the National Science Foundation, the Centers for Disease Control and Prevention, the World Health Organization, the National Institute of Aging, the Global Forum for Health Research, and centers, corporations, foundations, and other federal agencies.