VOTING WITH YOUR TWEET: An experiment in political forecasting
What is Voting with your tweet trying to do?
This is first and foremost an experiment in trying to see how well we can use social media to generate election predictions across the 2010 and 2012 election cycles. The FAQ below covers many of the aspects of how we are doing this, why it might work, and why it could fail.
How do you predict who will win in each district?
We mine Twitter’s data feed each night for mentions of Congressional candidates in the prior 24 hours. New mentions are added to the existing mentions of each candidate in our database. We then feed the message text into an algorithm that generates the prediction.
If you want more information see the FAQs about the data and algorithm below. If you want way more information, you can read the interim technical report here.
This seems dubious. Should I make decisions based on your predictions?
No. Voting with your Tweet is an experiment and should be treated as such. We think this might work and think we might know why. But we could fail spectacularly.
If I shouldn’t trust the predictions, then why are you doing this?
Lots of researchers have tried to predict elections after the fact. This is an attempt to build on that research and create a real-time prediction before the election. We think that publishing this live has several advantages:
- Anyone who is interested can observe the experiment.
- It keeps us honest: whether we succeed or fail, the predictions will be out there for all to see.
- Observers can offer constructive feedback.
- It gives us the opportunity post our own observations.
- It offers the general public a window into the intersection of social data and political research.
Who else is doing this kind of thing?
Predicting elections is an emerging area of research. See these papers if you are interested:
- O’Connor, Balasubramanyan, Routledge, and Smith. From Tweets to polls: Linking text sentiment to public opinion time series. In Proceedings of the International AAAI Conference on Weblogs and Social Media, pages 122–129, 2010.
- Tumasjan, A. et al. Election Forecasts With Twitter: How 140 Characters Reflect the Political Landscape. Social Science Computer Review. 2010.
- Tweetminster. Is word of mouth correlated to General Election results? The results are in. 2010.
Additionally, Twitter is running a Presidential election sentiment index.
Should I trust them?
Maybe, maybe not. The research is all very smart. But as Daniel Gayo-Avello of the University of Oviedo (Spain) has pointed out, most of these papers (and Mark’s technical report) were retrospective. They looked at elections after they’d already happened, rather than trying to predict elections before the fact. To test whether their predictors worked, they held back some of the election results from the data they used to train their predictors. They could then test the accuracy of the algorithm on this data, which the algorithm hadn’t seen before.
But because so much can change between elections, that’s not quite the same thing as trying to predict the outcomes of an entirely new election. For instance, an algorithm built on the 2010 United States Congressional Election might have found that tweets that mentioned both the candidate and “Obamacare” were good predictors of Republican victories. 2010 was a strong year for Republicans, and they ran against Democratic incumbents who’d supported President Obama’s health care reforms. But health care might play a totally different role in the 2012 race, so this predictor could be wrong.
Tweetminster is the big exception: they claim to have accurately forecasted the British Parliamentary election beforehand. But exactly how they did so remains a bit unclear.
Finally, there are many other unknowns: political campaigns are full of spam and distraction; and political speech is full of sarcasm, irony, and satire. Detecting real sentiment or intent from all this noise remains a difficult problem.
We also don’t have a strong grasp of whether politically-active Twitter users actually vote. We know, for instance, that Twitter use is very common among younger people, but that younger people are less likely to vote (if they can vote at all). Building what pollsters call a “likely voter model” for Twitter users remains an open problem.
How does your method differ from theirs?
These papers generally used one of two measures as a proxy for electoral success: either counting how many times a candidate was mentioned on Twitter during the campaign; or counting how many times Twitter users used “positive” or “negative” words to describe the candidates.
Our method takes a different tack. We used the results of the 2010 Congressional election to determine what terms were the best predictors of which party won in each Congressional district. We are now applying that algorithm to the 2012 election. For more information, see the algorithms section below.
Data and Sources Top
Where does the data come from?
We mine the data from Twitter using their open Search API. The Twitter Search API allows anyone to search the Twitter feed from the last week or so.
How do you get the data?
We query the Search API every night, for all mentions of each Congressional candidate in the past 24 hours. We download and store those tweets and related data about when they were created, whether they were retweets (RT) or mentions (MT), and other metadata.
Can you share the data?
No. Twitter’s Terms of Service make clear that data obtained through the API is fine for private purposes, but shouldn’t be redistributed.
What about candidates with really generic names? Don’t you get a lot of noise when searching for Candidate Smith?
We’ve taken steps to try and make sure the data is as clean as possible. For instance, when we gathered data on the 2010 election, we found that one of the candidates shared the same name as the kicker for the New Orleans Saints football team and gave us lot of data about game outcomes.
This time around, we used Google searches to identify whether similar situations existed, and made sure to clean up the data feed along the way.
What about spam?
Spam has become a problem on Twitter lately. We also don’t really want to over-rate a candidate simply because their campaign sends out tons of tweets. We already know they think their candidate is great, but that doesn’t tell us anything about what voters think. We don’t presently filter for spam explicitly–although we do try to filter out unrelated content. For instance, if a candidate shares a name with a sports figure, we might get lots of sports-related tweets in our data feed. We try to identify those and take them out before doing any further analysis.
Where did you get your map of congressional districts?
The map of the 113th Congress was compiled from finalized state maps once legal challenges had been resolved. Most states released shapefiles but in a few cases, the only available maps were pdfs. In those cases, the maps were rasterized, georeferenced and the districts were drawn by hand.
Why do some districts have no prediction?
There are three reasons why we might not predict a specific election:
- The race is uncontested–that is, only one party is running a candidate for the election
- One of the two major candidates in the race is neither Republican nor Democrat. This includes districts where candidates are from the same party.
- We only have Twitter data for one of the two candidates
In practice, most of the races we don’t predict come from districts with very little Twitter traffic for one or both candidates. The others are largely races for which we couldn’t identify both a Republican and a Democratic candidate.
What about Louisiana?
Louisiana’s races are conspicuously absent. We do go looking for their candidates on Twitter. But as of early October, we had no data for the Louisiana districts. District 2 returned data for only one candidate. In the other districts, Republican candidates either faced no challengers or a challenger from a party other than the Democrats.
How does the algorithm really work?
The algorithm works by mapping the term frequencies of words that appear in tweets about Congressional candidates, to either (1) which party will win the race (Democrats or Republicans); or (2) the vote share that the Democratic candidate will receive. This is a supervised machine learning approach that uses the results from the last election to train an algorithm for predicting results in the coming one.
How was the algorithm built up?
During the 2010 Congressional election, we gathered approximately 250,000 tweets about 356 Congressional races. Based on those messages and the election outcomes, we trained two machine learning algorithms to determine what word features of those tweets best predicted whether the Democratic or Republican party candidate won each race.
We did this with a series of steps:
- Count all of the unique bigrams (pairs of consecutive words) in each tweet
- Count the occurrence of all bigrams in each district for the entire race
- Weight the counts so that terms in tweets closer to the election were treated as more important than terms that occurred much earlier in the election
- Build up a matrix, using these counts, of all terms in tweets for both candidates in the race, for all districts. This gave us a matrix of D districts and T terms. Filtering for common words (like “the”, “is”, etc.) and uncommon words resulted in about 1000-2000 unique terms for 356 districts
- Use the SuperLearner algorithm to determine which terms were the best predictors (positive or negative) of whether a Democrat won the election
Can you go into more detail?
Sure. This is what’s called a bag of words approach: take some text, chop it up into individual terms, count how often each term appears, and use those “term-frequency” counts as a numerical representation of the message.
In the process, of course, we throw out a ton of information: grammar, sentence structure, sentence context, and so forth. We also usually throw out super-common words like is or won’t. But in practice the process often works really well.
Using these term-frequency representations, we can build up “documents” for each congressional district with all the messages about either of the two candidates. Because we assume that messages that came out later in the campaign are probably better indicators of who might win than those that came out earlier, we weight them more heavily in the aggregate term counts.
We can now do real math on this matrix. The SuperLearner is what’s called an ensemble machine learning algorithm. It takes a bunch of different algorithms (like random forests, support vector machines, lasso regression, and others) and uses each to try to predict the elections on their own. It then weights the individual predictions and combines them together. In practice, this can result in a prediction that’s more accurate than the best single algorithm.
Conceptually, how does the finished algorithm actually work?
We think the finished algorithm works like this:
First, it identifies from the language in a candidate’s tweets whether they are the incumbent or challenger. Since incumbents win about 85% of the time, this provides a good baseline.
It then adjusts the baseline prediction based on sentiment and action-related phrases. For instance, “voted hcr” (indicating that the incumbent voted for health care reform) was one of the most influential predictors alongside incumbency-related phrases. The algorithm weights those phrases positively or negatively, depending on how predictive they were of a candidate winning or losing.
How might it fail?
There are a few possible ways this could fail badly:
- 2010 was a strong year for Republicans and for challengers to incumbents. To the extent that our algorithm is biased based on these outcomes, we may over-predict either incumbent losses, or Republican victories.
- Relevant issues may change. In 2010, the Tea Party, health care, and spending were big issues. Those issues are still with us. But we also have new issues like Iran and the Euro crisis to deal with. We can’t foresee the future, so our algorithms don’t take those new issues into account in predicting who will win or lose.
Topic Models Top
What do those “topics” mean?
The topics are an attempt to learn from the Twitter stream what topics are most important to each Congressional district. We represent them as the top 5 terms that “best” represent the topic in each district.
Where do those “topics” come from?
We generate the topics from the same term-frequency data that we use for the election predictions. Topic models are a class of machine learning models that try to infer similarities among texts by looking at the distribution of words in lots of documents. In our case, those documents are the sets of messages mentioning each candidate, or both candidates in a Congressional district. For a good non-technical summary of topic modeling, see this paper by computer scientist Dave Blei.
How often are the topics regenerated?
We generate the topics daily, using the last five days of messages for each district. If we just used one day’s messages, we might miss candidates or districts with low daily Twitter volumes. Alternatively, if we used all the data we had, eventually the old data would swamp the new and the topics would change very little.
Do these topics actually mean anything?
Yes and no. Topic models have been very successful in doing things like categorizing magazine articles, tracking how the discussion of different scientific topics changed over time, or even helping to measure and predict who votes for what in the US Congress.
You should notice that the terms we choose to represent each topic are usually pretty coherent: topics about polling are consistently present, while more volatile topics like Libya or the party presidential conventions tend to fade in and out.
But the topics themselves don’t represent actual human judgement of what’s in the Twitter feed. Instead, they are a statistical construct, based on the probability that terms occur together often in multiple districts. They are useful because they help discover what Twitter users are talking about across hundreds of thousands of messages.
Who’s this “we”?
Mark Huberty wrote the original code and paper predicting the 2010 elections. He’s a doctoral candidate in Political Science at UC Berkeley.
Hillary Sanders is a UC Berkeley undergrad pursuing a double-major in Statistics and Environmental Economics & Policy. She did the background work preparing for the 2012 election, including the thankless task of cross-checking all the names. She also keeps the query jobs running, and generally keeps us all sane.