Predicting the NFL Using Twitter Shiladitya Sinha, Chris Dyer, Kevin Gimpel, Noah Smith Disclaimer This talk will not teach you how to become a successful sports bettor. Questions What is the NFL and what about it are we predicting? Why are we using Twitter? NFL = National Football League What about the NFL are we predicting? ● Two game outcomes are commonly bet upon: ○ Winner with the spread (Winner WTS) ○ Over-Under ● Outcomes are determined at the end of the game using numbers set by bookies before the game ● Due to transaction costs, an accuracy of 53% is needed to be profitable Point Spread and Over Under ● Point Spread: Signed number used to determine winner with the spread (WTS) given game score ○ If home_score + spread ■ > away_score: Home wins WTS ■ < away_score: Away wins WTS ■ = away_score: Push ● Over-Under: Betting line for total points scored by the two teams; bettor chooses “over” or “under” ● “Push” if point differential equals point spread or total points equals over-under Point Spread and Over Under Atlanta Falcons win a home game against the Philadelphia Eagles 33 - 26 But did they win with the spread? Point Spread Outcome <-7 Falcons lose WTS >-7 Falcons win WTS =-7 Push Why use Twitter? Twitter has been used to predict and/or explain: ● ● ● ● ● ● ● ● Opinion Polls Elections Spread of contagious diseases The stock market Movie revenue Food poisoning Civil unrest in Latin America and the Middle East Citations of scientific papers ...and the list goes on Why not sports? Why use Twitter to predict NFL games? ● Structure of the NFL ○ 32 Teams spread geographically across United States. ○ Regular season partitioned into 17 weeks ○ Roughly a week between games for a given team ● Participation of fans and spectators ○ Discussion on social media, including Twitter ○ Sports betting ○ Wisdom of the crowd Fan Participation Data Gathering Data Gathering ● Current and historical NFL game data from NFLdata.com ● Tweets from 2010-2012 seasons via Twitter Garden Hose stream (10% of all tweets) How to isolate relevant tweets? Data Alignment ● Use a preset list of team hashtags to identify relevant tweets ○ Discard tweets containing hashtags corresponding to multiple teams ○ Label tweets with the team their hashtags refer to ● Pittsburgh Steelers hashtags ○ #steelers #pittsburghsteelers #stillers #gosteelers #gosteelersgo #letsgosteelers #gostillers #gostillersgo #letsgostillers #stillernation #stillersnation #steelernation #steelersnation Fan Participation (Annotated) How many Tweets? Season Weekly Tweets (12 hours after previous game to 1 hour before upcoming game) Pregame Tweets Postgame Tweets (24 hours to 1 hour before upcoming game) (4 hours to 28 hours after previous game) 2010 40,385 53,294 185,709 2011 130,977 147,834 524,453 2012 266,382 290,879 1,014,473 Data Alignment ● Given the set a tweet is contained in, label it with the appropriate week of the season ○ For pregame and weekly tweets, this is the week the upcoming game will be played ○ For postgame tweets, this is the week the previous game was played The team and week labels of a tweet determine the unique game the tweet corresponds to A Simple Task Sanity check: Can we look at a postgame tweet and determine if the team it corresponds to won or lost? Postgame Tweet Analysis ● Use a bag-of-words model to extract features for each tweet ○ In-tweet word frequency features ○ TF-IDF features: In-tweet word frequency Word frequency over all postgame tweets from the week ● Classify tweets as wins for Home or Away team ○ For k in [1,16]: ■ Train classifier on all postgame tweets from 2010 to week k of the 2012 season Postgame Tweet Analysis ● Average accuracy of 67% over 2012 season ○ Very simple features and parameter settings ○ Room for improvement Highly weighted features (Top or bottom 30) for a Home team win: home: win home: victory away: loss home: won home: WIN away: lost home: Great away: lose away: refs Forecasting 2012 Season Live Predictions During the 2012 season, we tweeted predictions before games using @NFLOracle We predicted 34/58 or 65.4% of winner WTS results correctly! We did not encounter our tweets in the twitter garden-hose stream Training and Testing Predict outcomes of upcoming games using Logistic Regression: Seasons 2010, 2011 and Weeks [1, k-3] of 2012 Train on all games, with the exception of the last week of a season. Weeks [k-2, k-1] of 2012 Tune L1/L2 regularization parameters . Apply procedure over weeks 3-16 of 2012 (or current season) Week k of 2012 Predict and Test Predicting the NFL without using Twitter ● Use historical game data to create simple feature sets ○ Point spreads, over unders, scoring, etc. ● Combine pairs of simple feature to get a preliminary list of game based feature sets ○ Highest average accuracy of 56% ○ Serve as a baseline for Twitter derived feature sets Simple features derived from tweets ● Word level features (Twitter unigrams) ○ Use only words that appear in at least .1% of all tweets ○ Use log (word frequency + 1) over all weekly tweets corresponding to the game ● Creates a high dimensional feature space (~20k words) ○ This is ‘high’ due to the small set of games ● How to combine these features with features generated from historical game data? Dimensionality Reduction ● Canonical Correlation Analysis ○ Dimensionality Reduction ○ Combining of multiple data streams Results of applying CCA to Twitter unigrams and game statistics features: Feature Set Winner WTS Accuracy Twitter Unigrams 47.6 1 component CCA 50.4 2 component CCA 51.0 4 component CCA 51.9 8 component CCA 48.1 The Rate Feature ● Measure the difference in volume of a team’ s tweets across consecutive weeks ○ Easy to compute ○ Considers tweets collectively rather than individually ○ Doesn’t use point spread or any game statistics Highest average accuracy using a rate feature set is 56% Results Conclusion Our results suggest that a social media driven approach can be effective in predicting sporting events. Download our dataset of NFL game outcomes and tweet IDs! www.ark.cs.cmu.edu/football/
© Copyright 2025 Paperzz