As of mid-October, we’ve gathered nearly 600,000 messages mentioning Congressional candidates across the United States. We’d like to survey what users are talking about at any given moment, but reading hundreds of thousands of messages really isn’t possible. So we need some method of categorizing messages by the topics people are discussing. This could help us discover which topics were most popular and when and where.
But how would we discover a topic? If we consider the vocabulary — every unique word used in every message we’ve gathered — then we could think of a topic as a subset of words from that vocabulary. Messages that contain words like “poll”, “candidate”, and “leading” likely discuss topics related to who’s up or down in a given Congressional race. On the other hand, messages with words like “scandal”, “news”, and “denied” probably refer to some embarrassing situation that a candidate has found themselves in.
Of course, we’d like the ability to assign some words to more than one topic. For instance, “voted” and “candidate” might characterize messages that discuss how a member of Congress voted in the last session; while “voted” and “yesterday” could refer to users discussing the early absentee ballot they filed.
Topic models are a specific class of statistical models that attempt to model just this process. Rather than categorizing text just on the presence of certain words, they do so on the basis of word distributions. Each word is assigned a probability of belonging to one of a number of topics found in a set of messages. The model assigns a topic to each message based on the probabilities of all the words that message contains.
We run these models on our district-level data every night, looking each time at the last five days of messages that discuss either candidate in each district. For each district, we assign the “most likely” topic from all the topics we discover in the entire dataset. Thus, for instance, a district’s Twitter discussion might cover both “polls” and “scandal”, but the discussion of “polls” is much more common.
We usually discover between 30-40 different topics under discussion across all the Congressional districts we track. Some topics show up in only one or two districts, reflecting local issues or politics. Others show up in many districts, reflecting common concerns or national issues. Since we can’t easily display all 40 topics at once, the figure at the right looks only at the topics that appear most frequently across all the districts.
As you can see, one topic — polling — dominates the discussion in most districts (mapped in tan). That’s not surprising. But we can learn more than that. Grover Norquist’s anti-tax pledges (mapped in purple) dominate discussion in districts across the country. Healthcare and statements by Senate Majority Leader Harry Reid also appear in many different states (mapped in dark green). In contrast, interest in the ongoing scandal surrounding Secret Service agents and prostitution appears largely in the Southeast and Midwest (mapped in light green). Some districts in the Midwest picked up the controversy over Sensata, an Illinois firm targeted for outsourcing by Presidential candidate Mitt Romney’s former company, Bain Capital. Finally, and unsurprisingly, a focus on Massachusetts politics and related bills, debates, and votes dominates mostly in the Northeast (mapped in orange).
Each Congressional race is distinct, of course. Different candidates, different local issues, and other features make them so. But these topic models allow us to discover commonality across the hundreds of districts we track. Specifically, Americans and political campaigns love polls, Twitter users really love polls and we pay a lot of attention to them.